27

Developing a Test Blueprint

Characteristics of High-Quality Classroom Tests

ⁿValidity

ⁿReliability

ⁿFairness and Equivalence

ⁿPracticality

M O D U L E

Test Construction and Use

Outline Learning Goals

1. Discuss the importance of validity, reliability, fairness/ equivalence, and practicality in test construction.

2. Explain how a test blueprint is used to develop a good test.

Developing Test Items

ⁿAlternate-Choice (True/False) Items

ⁿMatching Exercises

ⁿMultiple-choice Items

ⁿShort-answer/Completion Items

ⁿEssay Tasks

Test Analysis and Revision

Summary Key Concepts Case Studies: Reﬂect and Evaluate

3. Discuss the usefulness of each test item format.
4. Compare and contrast the scoring considerations for the ﬁve test item formats.

5. Describe the beneﬁts of item analysis and revision.

boh7850x_CL8Mod27.p481-497.indd 481 boh7850x_CL8Mod27.p481-497.indd 481 10/9/08 9:19:14 AM

10/9/08 9:19:14 AM

482 cluster eight classroom assessment

CHARACTERISTICS OF HIGH-QUALITY CLASSROOM TESTS

Tests are only one form of assessment that you may ﬁnd useful in your classroom. Because teachers use tests quite frequently, you will need to become comfortable designing and evaluating classroom tests. Writing good test items takes considerable time and practice. Researchers have identiﬁed principles of high-quality test construction (Gronlund, 2003; McMillan, 2007; National Research Council, 2001). Despite this, countless teachers violate these principles when they develop classroom tests. Poorly constructed tests compromise the teacher’s ability to obtain an accurate assessment of students’ knowledge and skills, while high-quality assessments yield reliable and valid information about a student’s performance (McMillan, 2007). Let’s examine four facets by which to judge the quality of tests:

n validity,

n reliability,

n fairness and equivalence, and

n practicality.

Validity

Validity typically is deﬁned as the degree to which a test measures what it is intended to measure. Validity is judged in relation to the purpose for which the test is used. For example, if a test is given to assess a social studies unit on American government, then the test questions should focus on information covered in that unit. Teachers can optimize the validity of a test by evaluating the test’s content validity and creating an effective layout for the test items.

Content validity. Content validity refers to evidence that a test accurately represents a content domain—or reﬂects what teachers have actually taught (McMillan, 2007). A test cannot cover everything a student has learned, but it should provide a representative sample of learning (Weller, 2001). A test would have low content validity if it covered only a few ideas from lessons and focused on extraneous information or if it covered one portion of assigned readings but completely neglected other material that was heavily emphasized in class. When developing a test, teachers can consider several content validity questions (Nitko & Brookhart, 2007):

n Do the test questions emphasize the same things that were emphasized in day-to-day instruction?

n Do the test questions cover all levels of instructional objectives included in the lesson(s)?

n Does the weight assigned to each type of question reﬂect its relative value among all other types?

Test layout. The layout, or physical appearance, of a test also can affect the validity of the test results. For example, imagine taking a test by reading test items written on the board by the teacher or listening to questions dictated by the teacher. The teacher’s handwriting or students’ ability to see the board may hinder test performance, lowering the validity of the test score. Writing test items on the board or dictating problems also may put students with auditory or vision problems at a particular disadvantage. The dictation process can slow down test taking, making it very difﬁcult for students to go back to check over their answers a ﬁnal time before submitting them. Have you ever taken a test for which the instructions were unclear? A test-taker’s inability to understand the instructions could lower the level of performance, reducing the validity of the test results.

Teachers can follow these guidelines when deciding how to design a test’s layout:

n In general, tests should be typed so that each student has a printed copy. Exceptions would be dictated spelling tests or testing of listening and comprehension skills.

n The test should begin with clear directions at the top of the ﬁrst page or on a cover page. Typical directions include the number and format of items (e.g., multiple-choice, essay), any penalty for guessing, the amount of time allowed for completion of the test, and perhaps mention of test-taking strategies the teacher has emphasized (e.g., “Read each question completely before answering” or “Try to answer every question”).

n Test items should be grouped by format (e.g., all multiple-choice items in one section, all true/ false items in another section), and testing experts usually recommend arranging like items in order of increasing difﬁculty. Arranging items from easiest to hardest decreases student anxiety and increases performance (Tippets & Benson, 1989).

>><<

Validity as it applies to standardized tests: See page 534.

boh7850x_CL8Mod27.p481-497.indd 482 boh7850x_CL8Mod27.p481-497.indd 482 10/9/08 9:19:20 AM

10/9/08 9:19:20 AM

module twenty-seven test construction and use 483

Reliability. Consistency among scores on a test given twice indicates reliability.

87 83 74 77 93 88 68 70

88 80 73 78 92 88 69 72

Claire Jee Kristy Doug Nick Taylor Abby Marcus

Module 27:

Test Construction and Use

Monday Wednesday

Reliability

Reliability refers to the consistency of test results. If a teacher were to give a test to students on Monday and then repeat that same test on Wednesday (without additional instruction or practice in the interim), students should perform consistently from one day to the next. Let’s consider several factors that affect test reliability.

Test length and time provided for test taking. Would you rather take a test that has 5 items or one that has 25 items? Tests with more items generally are more reliable than tests with fewer items. However, longer assessments may not always be a practical choice given other constraints, such as the amount of time available for assessment. When you design tests, make sure that every student who has learned the material well has sufﬁcient time to complete the assessment. Consider the average amount of time it will take students to complete each question, and then adjust the number of test questions accordingly. Table 27.1 provides some target time requirements for different types of test items based on typical test taking at the middle school or high school level. Keep in mind that elementary school students require shorter tests than older students. The time allotted during the school day for each subject might be less than the average class period in middle or high school. So elementary school students might have only 30 minutes to take a test (requiring fewer items), while students in middle school or high school might have a 50-minute period. Students in the elementary grades also need shorter tests because they have shorter attention spans and tire more quickly than older students.

Frequency of testing. The frequency of testing—or how often students are tested—also affects reliability. The number and type of items included on tests may be inﬂuenced by the amount of material that has been covered since the last assessment was given. A review of research on frequency of testing provides several conclusions (Dempster, 1991):

n Frequent testing encourages better retention of information and appears to be more effective than a comparable amount of time spent studying and reviewing material.

n Tests are more effective in promoting learning if students are tested soon after they have learned material and then retested on the same material later.

n The use of cumulative questions on tests is key to effective learning. Cumulative questions give students an opportunity to recall and apply information learned in previous units.

Objectivity of scoring. Objectivity refers to the degree to which two or more qualiﬁed evaluators agree on what rating or score to assign to a student’s performance. Some types of test items, such as multiple choice, true/false, and matching, tend to be easier to score objectively than short-answer items and essays. Objective scoring increases the reliability of the test score. This does not mean that the more subjective item formats should be eliminated, because they have their own advantages, as we will see later in the module.

>><<

Reliability of test results: See page 535.

boh7850x_CL8Mod27.p481-497.indd 483 boh7850x_CL8Mod27.p481-497.indd 483 10/9/08 9:19:20 AM

10/9/08 9:19:20 AM

484 cluster eight classroom assessment

TA B L E 2 7.1 Time Requirements for Certain Assessment Tasks

Type of task Approximate time per item

True/false 20–30 seconds

Multiple choice (factual) 40–60 seconds

One-word ﬁll-in 40–60 seconds

Multiple choice (complex) 70–90 seconds

Matching (5 stems/6 choices) 2–4 minutes

Short answer 2–4 minutes

Multiple choice (with calculations) 2–5 minutes

Word problems (simple arithmetic) 5–10 minutes

Short essays 15–20 minutes

Data analysis/graphing 15–25 minutes

Drawing models/labeling 20–30 minutes

Extended essays 35–50 minutes

Source: Reprinted from Nitko, & Brookhart, 2007, p. 119.

Fairness and Equivalence

Fairness is the degree to which all students have an equal opportunity to learn material and demonstrate their knowledge and skill (Yung, 2001). Consider this multiple-choice question:

Which ball has the smallest diameter?

a. basketball
b. soccer ball
c. lacrosse ball
d. football

If students from a lower-income background answer this geometry question incorrectly, is it because they haven’t mastered the concept of diameter or because they lack prior experience with some of the items? Do you think girls also might lack experience with some of these items? Females, as a group, do not score as high as males on tests that reward mechanical or physical skills (Patterson, 1989) or mathematical, scientiﬁc, or technical skills (Moore, 1989). African-American, Latino, and Native-American students, as well as students for whom English is a second language, do not as a group perform as well as Anglos on formal tests (Garcia & Pearson, 1994). Asian Americans, who have a reputation as high achievers in American society and who tend to score higher than Anglo students in math, score lower on verbal measures (Tsang, 1989).

A high-quality assessment should be free of bias, or a systematic error in test items that leads students from certain subgroups (ethnic, socioeconomic, gender, religious, disability) to perform at a disadvantage (Hargis, 2006). Tests that include items containing bias reduce the validity of the test score. Additional factors that tend to disadvantage students from diverse cultural and linguistic backgrounds during testing include:

n speededness, or the inability to complete all items on a test as a result of proscribed time limitations (Mestre, 1984);

n test anxiety and testwiseness (Garcia, 1991; Rincon, 1980);

,

boh7850x_CL8Mod27.p481-497.indd 484 boh7850x_CL8Mod27.p481-497.indd 484 10/9/08 9:19:21 AM

10/9/08 9:19:21 AM

Fairness. Tests should be free of bias that would lead certain subgroups to perform at a disadvantage.

>><<

Test fairness and test bias: See page 547.

module twenty-seven test construction and use 485

Module 27:

Test Construction and Use

n differential interpretation of questions and foils (Garcia, 1991); and

n unfamiliar test conditions (Taylor, 1977).

To ensure that all students have an equal opportunity to demonstrate their knowledge and skills, teachers may need to make individual assessment accommodations with regard to format, response, setting, timing, or scheduling (Elliott, McKevitt, & Kettler, 2002).

In addition to assuring fairness among individuals within a particular classroom, assessments must demonstrate fairness from one school year to the next or even one class period to the next. If you decide to use different questions on a test on a particular unit, you should try to ensure equivalence from one exam to the next. Equivalence means that students past and present, or students in the same course but different class periods, are required to know and perform tasks of similar (but not identical) complexity and difﬁculty in order to earn the same grade (Nitko & Brookhart, 2007). This assumes that the content or learning goals have not changed and that the analysis of results from past assessments was satisfactory.

Practicality

Assessing students is very important, but the time devoted to assessment should not interfere with providing high-quality instruction. When creating assessments, teachers need to consider issues of practicality—the extent to which a particular form of assessment is economical and efﬁcient to create, administer, and score. For example, essay questions tend to be less time consuming to construct than good multiple-choice questions. However, multiple-choice questions can be scored much more quickly and are easier to score objectively. Performance tasks such as group projects or class presentations can be very difﬁcult to construct properly, but there are times when these formats allow the teacher to better assess what students have learned. When deciding what format to choose, consider whether:

n the format is relatively easy to construct and not too time consuming to grade,

n the time spent using the testing format could be better spent on teaching, and

n another format could meet assessment goals but be more efﬁcient.

Think about a particular test you might give in the grade level you intend to teach. How will you ensure its validity, reliability, and fairness? Evaluate your test’s practicality.

DEVELOPING A TEST BLUEPRINT

To increase reliability, validity, fairness, and equivalence, try making a blueprint prior to developing a test. A test blueprint is an assessment planning tool that describes the content the test will cover and the way students are expected to demonstrate their understanding of that content. When it is presented in a table format, as shown in Table 27.2, it is called a table of speciﬁcations. On the table, the row headings (down the left margin) indicate major topics that the assessment will cover. The column headings (across the top) list the six classiﬁcation levels of Bloom’s (1956) taxonomy of cognitive objectives, which provide a framework for composing tests. The ﬁrst three categories (knowledge, comprehension, and application) are often called lower-level objectives, and the last three categories (analysis, synthesis, and evaluation) are called higher-level objectives. Think of these six categories as a comprehensive framework for considering different cognitive goals that need to be met when planning for instruction. Each cell within the table of speciﬁcations itemizes a speciﬁc learning goal. These learning goals get more complex as you move from left to right. In the far left column, the student might be asked to deﬁne terms, while in the far right column the student is asked to synthesize or evaluate information in a meaningful way. For each cell, the teacher also needs to decide how many questions of each type to ask, as well as how each item will be weighted (how many points each item will be worth). The weight of each item should reﬂect its value or importance (Gronlund, 2003; Nitko & Brookhart, 2007).

When planning a test, it is not necessary to cover all six levels of Bloom’s taxonomy. It is more important that the test’s coverage matches the learning goals and emphasizes the same concepts or skills the teacher focused on in day-to-day instruction. Teaching is most effective when lesson plans, teaching activities, and learning goals are aligned and when all three are aligned with state standards. Assessment is most effective when it matches learning goals and teaching activities. Assuming that

>><<

Bloom's taxonomy and its application to instructional planning: See page 360.

Performance assessment: See page 498.

>><<

boh7850x_CL8Mod27.p481-497.indd 485 boh7850x_CL8Mod27.p481-497.indd 485 10/9/08 9:19:23 AM

10/9/08 9:19:23 AM

486 cluster eight classroom assessment

Content outline Knowledge Comprehension Application Analysis Synthesis Evaluation

Major Categories of Cognitive Taxonomy

TA B L E 2 7. 2

Sample Test Blueprint for a High School Science Unit

1. Historical concepts of force

Classify each force as a vector or scaler quantity, given its description. (2 test items)

Identify the concepts of force and list the empirical support for each concept. (2 test items )

4. Three-dimensional forces

Deﬁne the resultant three-dimensional force in terms of one-dimensional factors.

(1 test item)

Calculate the gravitational forces acting between two bodies.

(8 test items)

2. Types of force

Deﬁne each type of force and the term velocity.

(2 test items)

Deﬁne the resultant two-dimensional force in terms of one-dimensional factors.

(2 test items)

3. Two-dimensional forces

Find the x and y components of resultant forces on an object.

(6 test items)

5. Interaction of masses

Deﬁne the terms inertial mass, weight, active gravitational mass, and passive gravitational mass.

(6 or 7 test items)

Develop a deﬁnition of mass that explains the difference between inertial mass and weight.

(1 test item)

Source: Adapted from Kryspin & Feldhusen, 1974, p. 42.

boh7850x_CL8Mod27.p481-497.indd 486 boh7850x_CL8Mod27.p481-497.indd 486 10/9/08 9:19:23 AM

10/9/08 9:19:23 AM

module twenty-seven test construction and use 487

learning goals do not change, the same blueprint can be used to create multiple tests for use across class periods or school years.

Use the sample table of speciﬁcations in Table 27.2 to create your own test blueprint for a particular unit at the grade level you intend to teach. What topics will your test cover, and how will you require students to demonstrate their knowledge?

Module 27:

Test Construction and Use

DEVELOPING TEST ITEMS

After you have selected the content to be covered on the test, it is time to consider the format of test items to use. Teachers have several test formats available: alternate response, matching items, multiple choice, short answer, and essay. Alternate response (e.g., true/false), matching items, and multiple choice are called recognition tasks because they ask students to recognize correct information among irrelevant or incorrect statements. These types of items are also referred to as objective testing formats because they have one correct answer. Short answer/completion (ﬁll in the blanks) and essay items are recall tasks, requiring students to generate the correct answers from memory. These types of items are also considered a subjective testing format, with the scoring more open to interpretation. Objective and subjective formats differ in three important ways:

1. The amount of time to take the test. While teachers may be able to ask 30 multiple-choice questions in a 50-minute class period, they may be able to ask only four essay questions in the same amount of time.
2. The amount of time to score the test. Scoring of objective formats is straightforward, needing only an answer key with the correct responses. If teachers use optical scanning sheets (“bubble sheets”) for objective test responses, they can scan many sheets in a short amount of time. A middle school or high school teacher who uses the same 50-item test for several class periods can scan large batches of bubble sheets in minutes. Because extended short answers or essay questions involve more subjective judgments, they are more time-consuming to grade. Teachers may choose to use essays for classes with fewer students.

3. The objectivity in grading. Because extended short answers or essay questions are subjective, teachers can improve their objectivity in scoring these formats by using a rubric. A rubric, such as the example in Figure 27.1, is an assessment tool that provides preset criteria for scoring student responses, making grading simpler and more transparent. Rubrics ensure consistency of grading across students or grading sessions (when teachers grade a set of essays, stop, and return to grading).

Teachers should select the format that provides the most direct assessment of the particular skill or learning outcome being evaluated (Gronlund, 2003). As you will see next, each type of item format has its own unique characteristics. However, some general rules of item construction apply across multiple formats. All test items should:

n measure the required skill or knowledge;

n focus on important, not trivial, subject area content;

n contain accurate information, including correct spelling;

n be clear, concise, and free of bias; and

n be written at an appropriate level of difﬁculty and an appropriate reading level.

Alternate-Choice (True/False) Items

An alternate-choice item presents a proposition that a student must judge and mark as either true or false, yes or no, right or wrong. Alternate-choice items are recognition tasks because

Figure 27.1: Rubrics. Rubrics ensure consistency of grading when subjective test item formats are used.

Rubric for a Position Essay

Format (typed, double-spaced) 2 States a clear position for or against 3 Provides appropriate level of detail describing stated position 5 Includes arguments to support response from resources 5 Grammar, spelling, punctuation and clarity of writing 5 Includes references in appropriate format for resources 5

Points earned

Criteria Points possible

TOTAL 25

boh7850x_CL8Mod27.p481-497.indd 487 boh7850x_CL8Mod27.p481-497.indd 487 10/9/08 9:19:24 AM

10/9/08 9:19:24 AM

488 cluster eight classroom assessment

Guidelines for Writing True/False Items

BOX 27.1

1. Use short statements and simple vocabulary and sentence structure. Consider the following true/false item: “The true/false item is more subject to guessing but it should be used in place of a multiple-choice item, if well constructed, when there are a lack of plausible distractors.” Long, complex sentences such as this one are confusing and therefore more difﬁcult to judge as true or false.
2. Include only the central idea in each statement. Consider the following true-false item: “The true-false item, which is preferred by testing experts, is also called an alternate-response item.” This item has two ideas to evaluate: being favored by experts and being called an alternative-response item.
3. Use negative statements sparingly, and avoid double negatives. These also can be confusing.
4. Use precise wording so that the statement can unequivocally be judged true or false.
5. Write false statements that reﬂect common misconceptions held by students who have not achieved learning goals.
6. Avoid using statements that reproduce verbatim sentences from the textbook or reading material. Students may be able to judge the item true or false simply by memorizing and not understanding. This reduces the validity of the student’s test score. Does the student really know the material?
7. Statements of opinion should be attributed to some source (e.g., according to the textbook; according to the research on . . . ).
8. When cause-effect is being assessed, use only true propositions. For example, use “Exposure to ultraviolet rays can cause skin cancer” rather than “Exposure to ultraviolet rays does not cause skin cancer.” Evaluating that the positively worded statement is true is less confusing than evaluating that the negatively worded statement is false.
9. Do not overqualify the statement in a way that gives away the answer. Avoid using speciﬁc determiners (e.g., always, never, all, none, only, usually, may, and sometimes tend to be true).

10. Make true and false items of comparable length and make the same number of true and false items.
11. Randomly sort the items into a numbered sequence so that they are less likely to appear in a repetitive, predictable pattern (e.g., avoid T F T F . . . or . . . TT FF TT FF).

Sources: Ebel, 1979; Gronlund, 2003; Lindvall & Nitko, 1975.

students only need to recognize whether the statement matches a fact that they have in their memory. A true/false question might state:

An equilateral triangle has three sides of equal length. T F

Teachers also can design items that require multiple true/false responses. For example:

Scientists who study earthquakes have learned that:

1. the surface of the earth is in constant motion due to forces inside the planet. T F

2. an earthquake is the vibrations produced by breaking rocks along a fault line. T F

3. the time and place of earthquakes are easy to predict. T F

Alternate-choice questions are optimal when the subject matter lends itself to an either-or response. They allow teachers to ask a large number of questions in a short period of time (a practicality issue), making it possible to cover a wide range of topics within the domain being assessed. However, an obvious disadvantage of this format is that students have a 50% chance of getting the answer correct simply by guessing. Good alternate choice questions are harder to write than you might expect, Teachers can use the recommendations given in Box 27.1.

Matching Exercises

A matching exercise presents students with directions for matching a list of premises and a list of responses. The student must match each premise with one of the responses. Matching is a recognition task because the answers are present in the test item. A simple matching exercise might look like this:

boh7850x_CL8Mod27.p481-497.indd 488 boh7850x_CL8Mod27.p481-497.indd 488 10/29/08 9:46:14 AM

10/29/08 9:46:14 AM

module twenty-seven test construction and use 489

Module 27:

Test Construction and Use

Directions: In the left column are events that preceded the start of the Revolutionary War. For each event, choose the date it occurred from the right column, and place the letter identifying it on the line next to the event description.

Description of events (premise list):

______ 1. British troops ﬁre on demonstrators in the Boston

Massacre, killing ﬁve.

______ 2. British Parliament passes the Tea Act.

______ 3. Parliament passes the Stamp Act, sparking protests
in the American colonies.

______ 4. Treaty of Paris ends French power in North America.

Dates (response list):

a. 1763
b. 1765
c. 1768
d. 1770
e. 1773
f. 1778

Matching exercises are very useful for assessing a student’s ability to make associations or see relationships between two things (e.g., words and their deﬁnitions, individuals and their accomplishments, events in history and dates). This format provides a space-saving and objective way to assess learning goals. It is versatile in that words or phrases can be matched to symbols or pictures (e.g., matching a country name to the outline of that country on a map). Well-designed matching exercises can assess students’ comprehension of concepts, ideas, and principles. However, teachers often fall into the trap of using matching only for memorized lists (such as names and dates) and do not develop matching exercises that assess higher-level thinking. To construct effective matching items, consider these guidelines:

n Clearly explain the intended basis for matching, as in the directions for the sample item above.

n Use short lists of responses and premises.

n Arrange the response list in a logical order. For example, in the sample item, dates are listed in chronological order.

n Identify premises with numbers and responses with letters.

n Construct items so that longer phrases appear in the premise list and shorter phrases appear in the response list.

n Create responses that are plausible items for each premise. A response that clearly does not ﬁt any premise gives a hint to the correct answer.

n Avoid “perfect” one-to-one matching by including one or more responses that are incorrect choices or by using a response as the correct answer for more than one premise.

Multiple-choice Items

Each multiple-choice item contains a stem, or introductory statement or question, and a list of choices, called response alternatives. Multiple-choice items are recognition tasks because the correct answer is provided among the choices. A typical multiple-choice question looks like this:

The response alternatives include a keyed alternative (the correct answer) and distractors, or incorrect alternatives. The example includes three choices: one keyed alternative and two distractors. Other multiple-choice formats may include a four-choice option (a, b, c, d) or ﬁve-choice option (a, b, c, d, e). Box 27.2 presents a set of detailed guidelines for developing a multiple-choice format that addresses content, style, and tips for writing the stem and the choices.

Of the item formats that serve as recognition tasks, multiple-choice items are preferred by most assessment experts because this format offers many advantages. The multiple-choice format can be used to assess a great variety of learning goals, and the questions can be structured to assess factual knowledge as well as higher-order thinking. Because multiple-choice questions do not require writing,

What is the main topic of the reading selection Responding to Learners?

a. academic achievement
b. question and response techniques
c. managing student behavior

boh7850x_CL8Mod27.p481-497.indd 489 boh7850x_CL8Mod27.p481-497.indd 489 10/9/08 9:19:26 AM

10/9/08 9:19:26 AM

490 cluster eight classroom assessment

BOX 27.2

Content concerns:

1. Every item should reﬂect speciﬁc content and a single speciﬁc mental behavior, as called for in test speciﬁcations.
2. Base each item on important content to learn; avoid trivial content.
3. Use novel material to test higher-level learning. Paraphrase textbook language or language used during instruction in a test item to avoid testing for simple recall.
4. Keep the content of each item independent from content of other items on the test.
5. Avoid overly speciﬁc and overly general content when writing items.
6. Avoid opinion-based items.
7. Avoid trick items.
8. Keep vocabulary simple for the group of students being tested.

Writing the stem:

9. Ensure that the wording of the stem is very clear.

10. Include the central idea in the stem instead of in the response alternatives. Minimize the amount of reading in each item. Instead of repeating a phrase in each alternative, try to include it as part of the stem.
11. Avoid window dressing (excessive verbiage).
12. Word the stem positively; avoid negatives such as NOT or EXCEPT. If a negative word is used, use the word cautiously and always ensure that the word appears capitalized and boldface.

Writing the response alternatives:

13. Offering three response alternatives is adequate.

14. Make sure that only one of the alternatives is the right answer.
15. Make all distractors plausible. Use typical student misconceptions as your distractors. 16. Phrase choices positively; avoid negatives such as NOT.
17. Keep alternatives independent; they should not overlap.

18. Keep alternatives homogeneous in content and grammatical structure. For example, if one alternative is stated as a negative—“No running in the hall”—all alternatives should be phrased as negatives.

19. Keep all alternatives roughly the same length. If the correct answer is substantially longer or shorter than the dis-tractors, this may serve as a clue for test-takers.

20. None of the above should be avoided or used sparingly.

21. Avoid All of the above.

22. Vary the location of the correct answer in the list of alternatives (e.g., the correct answer should not always be C).

23. Place alternatives (A, B, C, D) in logical or numerical order. For example, if a history question lists dates as the response alternatives, they should be listed in chronological order.

24. Avoid giving clues to the right answer, such as:
a. speciﬁc determiners, including always, never, completely, and absolutely
b. choices identical to or resembling words in the stem
c. grammatical inconsistencies that cue the test-taker to the correct choice
d. obvious correct choice
e. pairs or triplets of options that clue the test-taker to the correct choice
f. blatantly absurd, ridiculous options

Style concerns: 25. Edit and proof items.

26. Use correct grammar, punctuation, capitalization, and spelling.

Source: Adapted from Haladyna, Downing, & Rodriquez, 2002.

boh7850x_CL8Mod27.p481-497.indd 490 boh7850x_CL8Mod27.p481-497.indd 490 10/29/08 9:46:18 AM

10/29/08 9:46:18 AM

Guidelines for Writing Multiple-choice Items

students who are poor writers have a more equal playing ﬁeld for demonstrating their understanding of the content than they have when answering essay questions. All students also have less chance to guess the correct answer in a multiple-choice format than they do with true/false items or a poorly written matching exercise. Also, the distractor that a student incorrectly chooses can give the teacher insight into the student’s degree of misunderstanding.

However, because multiple-choice items are recognition tasks, this format does not require the student to recall information independently. Multiple-choice questions are not the best option for

module twenty-seven test construction and use 491

assessing writing skills, self-expression, or synthesis of ideas or in situations where you want students to demonstrate their work (e.g., showing steps taken to solve a math problem). Also, poorly written multiple-choice questions can be superﬁcial, trivial, or limited to factual knowledge.

The guidelines for writing objective items, such as those given in Box 27.1 and Box 27.2, are especially important for ensuring the validity of classroom tests. Poorly written test items can give students a clue to the right answer. For example, speciﬁc determiners—extraneous clues to the answer such as always, never, all, none, only, usually, may—can enable students who may not know the material well to correctly answer some true/false or multiple-choice questions using testwiseness. Test-wiseness is an ability to use test-taking strategies, clues from poorly written test items, and prior experience in test taking to improve one’s score. Learned either informally or through direct instruction, it improves with grade level, experience in test taking, and motivation to do well on an assessment (Ebel & Frisbie, 1991; Sarnacki, 1979; Slakter, Koehler, & Hampton, 1970).

Short-answer/Completion Items

Short-answer/completion items come in three basic varieties. The question variety presents a direct question, and students are expected to supply a short answer (usually one word or phrase), as shown here: The completion variety presents an incomplete sentence and requires students to ﬁll in the blank, as in the next examples:

1. The capital of Kentucky is Frankfort .
2. 3(2 + 4) = 18

The association variety (sometimes called the identiﬁcation variety) presents a list of terms, symbols, labels, and so on for which students have to recall the corresponding answer, as shown below:

Short-answer items are relatively easy to construct. Teachers can use these general guidelines to help them develop effective short-answer items:

n Use the question variety whenever possible, because it is the most straightforward short-answer design and is the preferred option of experts.

n Be sure the items are clear and concise so that a single correct answer is required.

n Put the blank toward the end of the line for easier readability.

n Limit blanks within a short-answer question to one or two.

n Specify the level of precision (a word, a phrase, a few sentences) expected in the answer so students understand how much to write in their response.

While short-answer items typically assess students’ lower-level skills, such as recall of facts, they also can be used to assess more complex thinking skills if they are well designed. The short-answer format lowers a student’s probability of getting an answer correct by random guessing, a more likely scenario with alternate choice and multiple choice.

Short-answer items also are relatively easy to score objectively, especially when the correct response is a one-word answer. Partial credit can be awarded if a student provides a response that is close to the correct answer but not 100% accurate. You occasionally may ﬁnd that students provide unanticipated answers. For example, if you ask “Who discovered America?”, student responses might

boh7850x_CL8Mod27.p481-497.indd 491 boh7850x_CL8Mod27.p481-497.indd 491 10/9/08 9:19:28 AM

10/9/08 9:19:28 AM

Module 27:

Test Construction and Use

1. What is the capital of Kentucky? Frankfort
2. What does the symbol Ag represent on the periodic table? Silver
3. How many feet are in one yard? 3

Element Symbol

Barium Ba Calcium Ca Chlorine Cl

492 cluster eight classroom assessment

include “Christopher Columbus,” “Leif Erikson,” “the Vikings,” or “explorers who sailed across the ocean.” You then would have to make a subjective judgment about whether such answers should receive full or partial credit. Because reliability decreases as scoring becomes more subjective, teachers should use a scoring key when deciding on partial credit to maintain scoring consistency from one student to the next.

Essay Tasks

Essay tasks allow assessment of many cognitive skills that cannot be assessed adequately, if at all, through more objective item formats. Essay tasks can be classiﬁed into two types. Restricted response essay tasks limit the content of students’ answers as well as the form of their responses. A restricted response task might state: List the three parts of the memory system and provide a short statement explaining how each part operates. Extended response essay tasks require students to write essays in which they are free to express their thoughts and ideas and to organize the information as they see ﬁt. With this format, there usually is no single correct answer; rather the accuracy of the response becomes a matter of degree. Teachers can use these guidelines to develop effective essay questions:

n Cover the appropriate range of content and learning goals. One essay question may address one learning goal or several.

n Create essay questions that assess application of knowledge and higher-order thinking, not simply recall of facts.

n Make sure the complexity of the item is appropriate to the student’s developmental level. Elementary school students might not be required to write lengthy essays in essay booklets, whereas middle school or high school students would be expected to write more detailed essays.

n Specify the purpose of the task, the length of the response, time limits, and evaluation criteria. For example, instead of an essay task that states “Discuss the advantages of single-sex classrooms,” a high school teacher might phrase the task as: “You are addressing the school board. Provide three arguments in favor of single-sex classrooms.” The revised essay question provides a purpose for the response and speciﬁes the amount students need to write. Teachers also should specify how essays will be evaluated, for example, whether spelling and grammar count toward the grade and how students’ opinions will be evaluated.

Whether to use a restricted response format or an extended response format depends on the intended purpose of the test item as well as reliability and practicality issues. The restricted response format narrows the focus of the assessment to a speciﬁc, well-deﬁned area. The level of speciﬁcity makes it more likely that students will interpret the question as intended. This makes scoring easier, because the range of possible responses is also restricted. Scoring is more reliable because it is easier to be very clear about what constitutes a correct answer. On the other hand, if you want to know how a student organizes and synthesizes information, the narrowly focused, restricted response format may not serve the assessment purpose well. Extended response questions are suitable for assessing students’ writing skill and/or subject matter knowledge. If your learning goals involve skills such as organizing ideas, critically evaluating a certain position or argument, communicating feelings, or demonstrating creative writing skill, the extended response format provides an opportunity for students to demonstrate these skills.

Because extended responses are subjective, this format generally has poor scoring reliability. Given the same essay response to evaluate, several different teachers might award different scores, or the same teacher might award different scores to student essays at the beginning and end of a pile of tests. When the scores given are inconsistent from one response to the next, the validity of the assessment results is lessened. Also, teachers tend to evaluate the essays of different students according to different criteria, evaluating one essay in terms of its high level of creativity and another more critically in terms of grammar and spelling. A signiﬁcant disadvantage is that grading extended essay responses is a very time-consuming process, especially if the teacher takes the time to provide detailed written feedback to help students improve their work.

Restricted and extended response essay items have special scoring considerations unique to the essay format. Essay responses, especially extended ones, tend to have poor scoring reliability and lower practicality. As discussed earlier, a scoring rubric can help teachers score essay answers more fairly and consistently. The following guidelines offer additional methods for ensuring consistency:

boh7850x_CL8Mod27.p481-497.indd 492 boh7850x_CL8Mod27.p481-497.indd 492 10/9/08 9:19:28 AM

10/9/08 9:19:28 AM

module twenty-seven test construction and use 493

Module 27:

Test Construction and Use

n Use a set of anchor essays—student essays that the teacher selects as examples of performance at different levels of a scoring rubric (Moskal, 2003). For example, a teacher may have a representative A essay, B essay, and so on. Anchors increase reliability because they provide a comparison set for teachers as they score student responses. A set of anchor essays without student names can be used to illustrate the different levels of the scoring rubric to both students and parents.

n If an exam has more than one essay question, score all students on the ﬁrst question before moving on to the next question. This method increases the consistency and uniformity of your scoring.

n Score subject matter content separately from other factors such as spelling or neatness.

n To increase fairness and eliminate bias, score essays anonymously by having students write their names on the back of the exam.

n Take the time to provide effective feedback on essays.

Think about a particular unit on which your students might be tested in the grade level you intend to teach. What test item formats would you choose to use, and why? Would your choice of item formats be different for a pretest and for a test given at the end of a unit?

TEST ANALYSIS AND REVISION

No test is perfect. Good teachers evaluate the tests they use and make necessary revisions to improve them. When using objective test formats such as alternate response and multiple choice, teachers can evaluate how well test items function by using item analysis, a process of collecting, summarizing, and using information from student responses to make decisions about each test item. When teachers use optical scanning sheets (“bubble sheets”) for test responses, they can use a computer program not only to score the responses, but also to generate an item analysis. Item analysis provides two statistics that indicate how test items are functioning: an item difﬁculty index and an item discrimination index.

The item difﬁculty index reports the proportion of the group of test-takers who answered an item correctly, ranging from 0 to 1. Items that are functioning appropriately should have a moderate item difﬁculty index to distinguish students who have grasped the material from those who have not. This increases the validity of the test score. As a rule of thumb, a moderate item difﬁculty can range from
.3 to .7. However, the optimal item difﬁculty level must take guessing into account. For example, with a four-choice (a, b, c, d) multiple-choice item, the chance of guessing is 25%. The optimal difﬁculty level for this type of item is the midpoint between chance (25%) and 100%, or 62.5%. So item difﬁculties for a four-choice multiple-choice item should be close to or around .625.

Item difﬁculty indexes that are very low (e.g., 0–.3) indicate that very few students answered correctly. This information can identify particular concepts that need to be retaught, provide clues about the strengths and weaknesses of instruction, or indicate test items that are poorly written and need to be revised or removed. Item difﬁculty indexes that are very high (e.g., .8–.9) indicate that the majority of students answered the items correctly, suggesting that the items were too easy. While we want students to perform well, items that are too easy do not discriminate between the students who know the material well and those who do not, which is one purpose of assessing student performance.

Item discrimination indexes add this crucial piece of information. An item discrimination index describes the extent to which a particular test item differentiates high-scoring students from low-scoring students. It is calculated as the difference between the proportion of the upper group (highest scorers) who answered a particular item correctly and the proportion of the lower group (lowest scorers) who answered that same item correctly. The resulting index ranges from –1 to +1. If a test is well constructed, we would expect all test items to be positively discriminating, meaning that those students in the highest scoring group get the items correct while those in the lowest scoring group get the items wrong. Test items with low (below .4), zero, or negative discrimination indexes reduce a test score’s validity and should be rewritten or replaced. A low item discrimination index indicates that the item cannot accurately discriminate between students who know the material and those who don’t. An item discrimination index of zero indicates that the item cannot discriminate at all between high scorers and low scorers. A test with many zero discriminations does not provide a valid measure of student achievement. Items with negative discrimination indexes indicate that lower-scoring students tended to answer the items correctly while higher-scoring students tended to get them wrong. If a test contains items with negative discriminations, the total score on the exam will not provide useful information.

boh7850x_CL8Mod27.p481-497.indd 493 boh7850x_CL8Mod27.p481-497.indd 493 10/9/08 9:19:29 AM

10/9/08 9:19:29 AM

494 cluster eight classroom assessment

When item analyses suggest that particular test items did not function as expected, the source of the problem must be investigated. The item analysis may indicate that the problem stems from:

n the item itself. For example, multiple-choice items may have poorly functioning distractors, ambiguous alternatives, questions that invite random guessing, or items that have been keyed incorrectly.

n student performance. Students may have misread the item and misunderstood it. When developing a new test item, the teacher may think it is worded clearly and has one distinctly correct answer. However, item analysis might reveal that students interpreted that test item differently than intended.

n teacher performance. Low item difﬁculties, which indicate that students did not grasp the material, may suggest that the teacher’s performance needs to be improved. Perhaps concepts needed further clariﬁcation or the teacher needs to consider a different approach to presenting the material.

The process of discarding certain items that did not work well, revising other items, and testing a few new items on each test eventually leads to a much higher quality test (Nitko & Brookhart, 2007). Once “good” items have been selected, teachers can use software to store test items in a computer ﬁle so they can select a subset of test items, make revisions, assemble tests, and print out tests for use in the classroom. Certain software products even allow the teacher to sort items according to their alignment with curriculum standards. Computer applications vary in quality, cost, user-friendliness, and amount of training required for their use.

An item analysis shows that three of your test items have very low item difﬁculties, and you plan to revise these before you use the test next time. Because these poorly functioning items affect students’ test scores, what can you do to improve the validity of your current students’ test scores?

boh7850x_CL8Mod27.p481-497.indd 494 boh7850x_CL8Mod27.p481-497.indd 494 10/9/08 9:19:29 AM

10/9/08 9:19:29 AM

key concepts 495

Summary

Discuss the importance of validity, reliability, fairness/ equivalence, and practicality in test construction. Classroom tests should be evaluated based on their validity, reliability, fairness/equivalence, and practicality. Teachers must consider how well the test measures what it is supposed to measure (validity), how consistent the results are (reliability), the degree to which all students have an equal opportunity to learn and demonstrate their knowledge and skill (fairness/equivalence), and how economical and efﬁcient the test is to create, administer, and score (practicality).

Explain how a test blueprint is used to develop a good test. A test blueprint is an assessment planning tool that describes the content the test will cover and the way students are expected to demonstrate their understanding of that content. A test blueprint, or table of speciﬁcations, helps teachers develop good tests because it matches the test to instructional objectives and actual instruction. Test blueprints take into consideration the importance of each learning goal, the content to be assessed, the material that was emphasized during instruction, and the amount of time available for students to complete the test.

Discuss the usefulness of each test item format.

Alternate-choice items allow teachers to cover a wide range of topics by asking a large number of questions in a short period of time. Matching exercises can be useful for assessing a student’s ability to make associations or see relationships between two things. Multiple-choice items are preferred by assessment experts because they focus on reading and thinking but do not

require writing, give teachers insight into students’ degree of misunderstanding, and can be used to assess a variety of learning goals. Short-answer questions are relatively easy to construct, can assess both lower-order and higher-order thinking skills, and minimize the chance that students will answer questions correctly by randomly guessing. Essay questions can provide an effective assessment of many cognitive skills that cannot be assessed adequately, if at all, with more objective item formats.

Compare and contrast the scoring considerations for the ﬁve test item formats. Objective formats such as alternate choice, matching, and multiple choice have one right answer and tend to be relatively quick and easy to score. Short-answer/completion questions, if well designed, also can be relatively easy to score as long as they are written clearly and require a very speciﬁc answer. Essay questions, especially extended essay formats, tend to have poor scoring reliability and lower practicality because scoring is subjective and can be time-consuming. Scoring rubrics allow teachers to score essay answers more fairly and consistently.

Describe the beneﬁts of item analysis and revision.

Item analysis determines whether a test item functions as intended, indicates areas where students need clariﬁcation of concepts, and points out where the curriculum needs to be improved in future presentations of the material. The process of discarding certain items that did not work well, revising other items, and testing a few new items on each test eventually leads to a much higher quality test.

Key Concepts

alternate-choice item content validity distractors equivalence extended response essay fairness item analysis item difﬁculty index item discrimination index

matching exercise multiple-choice item objective testing objectivity practicality recall tasks recognition tasks reliability response alternatives

restricted response essay rubric short-answer/completion items speciﬁc determiners stem subjective testing table of speciﬁcations test blueprint validity

boh7850x_CL8Mod27.p481-497.indd 495 boh7850x_CL8Mod27.p481-497.indd 495 10/9/08 9:19:31 AM

10/9/08 9:19:31 AM

496 case studies: reﬂect and evaluate

,

Case Studies: Refl ect and Evaluate

Early Childhood: “The Zoo”

These questions refer to the case study on page 458.

1. Sanjay and Vivian do not do any paper-and-pencil testing in this sce-nario. What factors should they consider when deciding whether to use traditional tests as a form of assessment?

2. If Vivian and Sanjay choose to use tests or quizzes with the children, what steps should they take to ensure the validity of their results?

3. What issues could potentially interfere with the reliability of test results among preschoolers?

4. Most of the preschoolers in Vivian and Sanjay’s classroom do not yet know how to read independently. How would this impact test construction and use for this age group?

5. Given your response to question 4, how might Vivian and Sanjay assess a speciﬁc set of academic skills (e.g., letter or number recognition) with their students in a systematic way?

6. The lab school is located in a large city and has a diverse group of students. How might issues of fairness come into play when designing assessments for use with these students?

Elementary School: “Writing Wizards”

These questions refer to the case study on page 460.

1. Brigita provides an answer key for the Grammar Slammer activity (quiz) and reviews it with the class.

How does the use of such a key impact the level of objectivity in scoring responses?

2. Brigita uses a matching exercise (quiz) to evaluate students’ understanding of their weekly vocabulary words. What are the advantages and disadvantages of this form of assessment?

3. If Brigita decides she wants to vary her quiz format by using multiple-choice questions instead, what factors should she keep mind in order to write good multiple-choice questions?

4. Brigita uses a combination of traditional tests and applied writing activities to assess her students’ writing skills. Are there other subject areas in a fourth-grade classroom in which tests would be a useful assessment choice? Explain.

5. Imagine that Brigita will be giving a social studies test and wants to incorporate writing skills as part of this assessment. What are the advantages and disadvantages of the essay format?

6. Based on what you read in the module, what advice would you give Brigita about how to score the responses on the social studies test referred to in question 5?

Middle School: “Assessment: Cafeteria Style”

These questions refer to the case study on page 462.

1. From Ida’s perspective, how do the development, implementation, and grading of the multiple-choice test rate in terms of practicality?

2. Do 50 multiple-choice questions seem an appropriate number for a middle school exam in a

50-minute class period? Why or why not?

3. What are the advantages of using multiple-choice questions rather than one of the other question formats available (alternate choice, matching, short answer, or essay)? Would your answer vary for different subjects?

4. What are some limitations of using only multiple-choice questions to test students’ understanding of course content?

5. Ida provided a rubric to help her students better understand what was expected from them on the project option. What could she have done to clarify her expectations for those students who choose to take the exam?

6. If you were the one writing the questions for Ida’s exam, what are some guidelines you would follow to make sure the questions are well constructed?

boh7850x_CL8Mod27.p481-497.indd 496 boh7850x_CL8Mod27.p481-497.indd 496 10/9/08 9:19:33 AM

10/9/08 9:19:33 AM

case studies: reﬂect and evaluate 497

High School: “Innovative Assessment Strategies”

These questions refer to the case study on page 464.

1. What are some advantages tests have that might explain why so many teachers at Jefferson High rely on them as a primary means of assessment?

2. In the New Hampshire humanities course described in the memo, students are asked to design their own test as part of their ﬁnal project. How could these students use a test blueprint to develop a good test?

3. What criteria should the New Hampshire teacher use to evaluate the quality of the tests designed by students?

4. Imagine that a New Hampshire teacher gave a ﬁnal exam using a combination of the test questions created by students. What could the teacher do to determine how the questions actually functioned?

5. The California teacher used a “dream home” project to assess his students’ conceptual understanding of area relationships. Is it possible to assess this level of understanding by using multiple-choice questions on a test? Explain you answer.

6. Imagine that the Rhode Island social studies teacher wants students to pay close attention to one another’s oral history presentations, so she announces that students will be given a test on the material. If she wants to test basic recall of facts and wants the test to be quick and easy to grade, which item formats would you suggest she use on her test? Explain.

boh7850x_CL8Mod27.p481-497.indd 497 boh7850x_CL8Mod27.p481-497.indd 497 10/9/08 9:19:35 AM

10/9/08 9:19:35 AM

EdPsych Modules word boh7850x CL8Mod27

27

Developing a Test Blueprint

Characteristics of High-Quality Classroom Tests

M O D U L E

Test Construction and Use

Outline Learning Goals

1. Discuss the importance of validity, reliability, fairness/ equivalence, and practicality in test construction.

2. Explain how a test blueprint is used to develop a good test.

Developing Test Items

Test Analysis and Revision

Summary Key Concepts Case Studies: Reﬂect and Evaluate

3. Discuss the usefulness of each test item format. 4. Compare and contrast the scoring considerations for the ﬁve test item formats.

5. Describe the beneﬁts of item analysis and revision.

CHARACTERISTICS OF HIGH-QUALITY CLASSROOM TESTS

Validity

Teachers can follow these guidelines when deciding how to design a test’s layout:

>><<

Validity as it applies to standardized tests: See page 534.

Module 27:

Test Construction and Use

Reliability

>><<

Reliability of test results: See page 535.

TA B L E 2 7.1 Time Requirements for Certain Assessment Tasks

Type of task Approximate time per item

True/false 20–30 seconds

Multiple choice (factual) 40–60 seconds

One-word ﬁll-in 40–60 seconds

Multiple choice (complex) 70–90 seconds

Matching (5 stems/6 choices) 2–4 minutes

Short answer 2–4 minutes

Multiple choice (with calculations) 2–5 minutes

Word problems (simple arithmetic) 5–10 minutes

Short essays 15–20 minutes

Data analysis/graphing 15–25 minutes

Drawing models/labeling 20–30 minutes

Extended essays 35–50 minutes

Fairness and Equivalence

Fairness is the degree to which all students have an equal opportunity to learn material and demonstrate their knowledge and skill (Yung, 2001). Consider this multiple-choice question:

Which ball has the smallest diameter?

a. basketball b. soccer ball c. lacrosse ball d. football

,

>><<

Test fairness and test bias: See page 547.

Module 27:

Test Construction and Use

To ensure that all students have an equal opportunity to demonstrate their knowledge and skills, teachers may need to make individual assessment accommodations with regard to format, response, setting, timing, or scheduling (Elliott, McKevitt, & Kettler, 2002).

Practicality

Think about a particular test you might give in the grade level you intend to teach. How will you ensure its validity, reliability, and fairness? Evaluate your test’s practicality.

DEVELOPING A TEST BLUEPRINT

>><<

Bloom's taxonomy and its application to instructional planning: See page 360.

Performance assessment: See page 498.

>><<

Content outline Knowledge Comprehension Application Analysis Synthesis Evaluation

Major Categories of Cognitive Taxonomy

TA B L E 2 7. 2

Sample Test Blueprint for a High School Science Unit

1.

Historical concepts of force

Classify each force as a vector or scaler quantity, given its description. (2 test items)

Identify the concepts of force and list the empirical support for each concept. (2 test items )

4. Three-dimensional forces

Deﬁne the resultant three-dimensional force in terms of one-dimensional factors.

(1 test item)

Calculate the gravitational forces acting between two bodies.

(8 test items)

2. Types of force

Deﬁne each type of force and the term velocity.

(2 test items)

Deﬁne the resultant two-dimensional force in terms of one-dimensional factors.

(2 test items)

3. Two-dimensional forces

Find the x and y components of resultant forces on an object.

(6 test items)

5. Interaction of masses

Deﬁne the terms inertial mass, weight, active gravitational mass, and passive gravitational mass.

(6 or 7 test items)

Develop a deﬁnition of mass that explains the difference between inertial mass and weight.

3. Discuss the usefulness of each test item format.
4. Compare and contrast the scoring considerations for the ﬁve test item formats.

a. basketball
b. soccer ball
c. lacrosse ball
d. football

10. Make true and false items of comparable length and make the same number of true and false items.
11. Randomly sort the items into a numbered sequence so that they are less likely to appear in a repetitive, predictable pattern (e.g., avoid T F T F . . . or . . . TT FF TT FF).

______ 3. Parliament passes the Stamp Act, sparking protests
in the American colonies.

a. 1763
b. 1765
c. 1768
d. 1770
e. 1773
f. 1778

a. academic achievement
b. question and response techniques
c. managing student behavior