Writing Multiple Choice Test Questions

SUMMARY

This is a tutorial on creating multiple choice questions, framed by Haladyna’s heuristics for test design and Anderson & Krathwohl’s update to Bloom’s taxonomy. My interest in computer-gradable test questions is to support teaching and learning rather than high-stakes examination. Some of the design heuristics are probably different for this case. For example, which is the more desirable attribute for a test question:

  1. defensibility (you can defend its fairness and appropriateness to a critic) or
  2. potential to help a student gain insight?

In high-stakes exams, (a) [defensibility] is clearly more important, but as a support for learning, I’d rather have (b) [support for insight].

This tutorial’s examples are from software engineering, but from my perspective as someone who has also taught psychology and law, I think the ideas are applicable across many disciplines.

The tutorial’s advice and examples specifically target three projects:

CONTENTS

STANDARDS SPECIFIC TO THE BBST AND OPEN CERTIFICATION QUESTIONS

1. Consider a question with the following structure:

Choose the answer:

  1. First option
  2. Second option

The typical way we will present this question is:

Choose the answer:

  1. First option
  2. Second option
  3. Both (a) and (b)
  4. Neither (a) nor (b)

  • If the correct answer is (c) then the examinee will receive 25% credit for selecting only (a) or only (b).

2. Consider an question with the following structure:

Choose the answer:

  1. First option
  2. Second option
  3. Third option

The typical way we will present this question is:

Choose the answer:

  1. First option
  2. Second option
  3. Third option
  4. (a) and (b)
  5. (a) and (c)
  6. (b) and (c)
  7. (a) and (b) and (c)

  • If the correct answer is (d), the examinee will receive 25% credit for selecting only (a) or only (b). Similarly for (e) and (f).
  • If the correct answer is (g) (all of the above), the examinee will receive 25% credit for selecting (d) or (e) or (f) but nothing for the other choices.

3. Consider an question with the following structure:

Choose the answer:

  1. First option
  2. Second option
  3. Third option
  4. Fourth option

The typical ways we might present this question are:

Choose the answer:

  1. First option
  2. Second option
  3. Third option
  4. Fourth option

OR

Choose the answer:

  1. First option
  2. Second option
  3. Third option
  4. Fourth option
  5. (a) and (c)
  6. (a) and (b) and (d)
  7. (a) and (b) and (c) and (d)

There will be a maximum of 7 choices.

The three combination choices can be any combination of two, three or four of the first four answers.

  • If the correct answer is like (e) (a pair), the examinee will receive 25% credit for selecting only (a) or only (b) and nothing for selecting a combination that includes (a) and (b) but also includes an incorrect choice.
  • If the correct answer is (f) (three of the four), the examinee will receive 25% credit for selecting a correct pair (if (a) and (b) and (d) are all correct, then any two of them get 25%) but nothing for selecting only one of the three or selecting a choice that includes two or three correct but also includes an incorrect choice.
  • If the correct answer is (g) (all correct), the examinee will receive a 25% credit for selecting a correct triple.

DEFINITIONS AND EXAMPLES

Definitions

Here are a few terms commonly used when discussing the design of multiple choice questions. See the Reference Examples, below.

  • Test: In this article, the word “test” is ambiguous. Sometimes we mean a software test (an experiment that can expose problems in a computer program) and sometimes an academic test (a question that can expose problems in someone’s knowledge). In these definitions, “test” means “academic test.”

  • Test item: a test item is a single test question. It might be a multiple choice test question or an essay test question (or whatever).
  • Content item: a content item is a single piece of content, such as a fact or a rule, something you can test on.
  • Stem: The opening part of the question is called the stem. For example, “Which is the best definition of the testing strategy in a testing project? ” is Reference Example B’s stem.
  • Distractor: An incorrect answer. In Reference Example B, (b) and (c) are distractors.
  • Correct choice: The correct answer for Reference Example B is (a) “The plan for applying resources and selecting techniques to achieve the testing mission.”
  • The Question format: The stem is a complete sentence and asks a question that is answered by the correct choice and the distractors. Reference Example A has this format.
  • The Best Answer format: The stem asks a complete question. Most or all of the distractors and the correct choice are correct to some degree, but one of them is stronger than the others. In Reference Example B, all three answers are plausible but in the BBST course, given the BBST lectures, (a) is the best.
  • The Incomplete Stem format: The stem is an incomplete sentence that the correct choice and distractors complete. Reference Example C has this format.
  • Complex formats: In a complex-format question, the alternatives include simple answers and combinations of these answers. In Reference Example A, the examinee can choose (a) “We can never be certain that the program is bug free” or (d) which says that both (a) and (b) are true or (f) which says that all of the simple answers (a, b and c) are true.
  • Learning unit: A learning unit typically includes a limited set of content that shares a common theme or purpose, plus learning support materials such as a study guide, test items, an explicit set of learning objectives, a lesson plan, readings, lecture notes or video, etc.
  • High-stakes test: A test is high-stakes if there are significant benefits for passing the test or significant costs of failing it.

The Reference Examples

For each of the following, choose one answer.

A. What are some important consequences of the impossibility of complete testing?

  1. We can never be certain that the program is bug free.
  2. We have no definite stopping point for testing, which makes it easier for some managers to argue for very little testing.
  3. We have no easy answer for what testing tasks should always be required, because every task takes time that could be spent on other high importance tasks.
  4. (a) and (b)
  5. (a) and (c)
  6. (b) and (c)
  7. All of the above

B. Which is the best definition of the testing strategy in a testing project?

  1. The plan for applying resources and selecting techniques to achieve the testing mission.
  2. The plan for applying resources and selecting techniques to assure quality.
  3. The guiding plan for finding bugs.

C. Complete statement coverage means …

  1. That you have tested every statement in the program.
  2. That you have tested every statement and every branch in the program.
  3. That you have tested every IF statement in the program.
  4. That you have tested every combination of values of IF statements in the program.

D. The key difference between black box testing and behavioral testing is that:

  1. The test designer can use knowledge of the program’s internals to develop a black box test, but cannot use that knowledge in the design of a behavioral test because the behavioral test is concerned with behavior, not internals.
  2. The test designer can use knowledge of the program’s internals to develop a behavioral test, but cannot use that knowledge in the design of a black box test because the designer cannot rely on knowledge of the internals of the black box (the program).
  3. The behavioral test is focused on program behavior whereas the black box test is concerned with system capability.
  4. (a) and (b)
  5. (a) and (c)
  6. (b) and (c)
  7. (a) and (b) and (c)

E. What is the significance of the difference between black box and glass box tests?

  1. Black box tests cannot be as powerful as glass box tests because the tester doesn’t know what issues in the code to look for.
  2. Black box tests are typically better suited to measure the software against the expectations of the user, whereas glass box tests measure the program against the expectations of the programmer who wrote it.
  3. Glass box tests focus on the internals of the program whereas black box tests focus on the externally visible behavior.

ITEM-WRITING HEURISTICS

Several papers on the web organize their discussion of multiple choice tests around a researched set of advice from Haladyna, Downing & Rodriguez or the updated list from Haladyna (2004). I’ll do that too, tying their advice to back to our needs for software testing.

Content Guidelines
1
Every item should reflect specific content and a single specific cognitive process, as called for in the test specifications (table of specifications, two-way grid, test blueprint).
2
Base each item on important content to learn; avoid trivial content.
3
Use novel material to meaure understanding and the application of knowledge and skills.
4
Keep the content of an item independent from content of other items on the test.
5
Avoid overspecific and overgeneral content.
6
Avoid opinion-based items.
7
Avoid trick items.
8
Format items vertically instead of horizontally.
Style and Format Concerns
9
Edit items for clarity.
10
Edit items for correct grammar, punctuation, capitalization and spelling.
11
Simplify vocabulary so that reading comprehension does not interfere with testing the content intended.
12
Minimize reading time. Avoid excessive verbiage.
13
Proofread each item.
Writing the Stem
14
Make the directions as clear as possible.
15
Make the stem as brief as possible.
16
Place the main idea of the item in the stem, not in the choices.
17
Avoid irrelevant information (window dressing).
18
Avoid negative words in the stem.
Writing Options
19
Develop as many effective options as you can, but two or three may be sufficient.
20
Vary the location of the right answer according to the number of options. Assign the position of the right answer randomly.
21
Place options in logical or numerical order.
22
Keep options independent; choices should not be overlapping.
23
Keep the options homogeneous in content and grammatical structure.
24
Keep the length of options about the same.
25
“None of the above” should be used sparingly.
26
Avoid using “all of the above.”
27
Avoid negative words such as not or except.
28
Avoid options that give clues to the right answer.
29
Make all distractors plausible.
30
Use typical errors of students when you write distractors.
31
Use humor if it is compatible with the teacher; avoid humor in a high-stakes test.

Now to apply those to our situation.

CONTENT GUIDELINES

1. Every item should reflect specific content and a single specific cognitive process, as called for in the test specifications (table of specifications, two-way grid, test blueprint).

Here are the learning objectives from the AST Foundations course. Note the grid (the table), which lists the level of knowledge and skills in the course content and defines the level of knowledge we hope the learner will achieve. For discussions of level of knowledge, see my blog entries on Bloom’s taxonomy [1] [2] [3]:

Learning Objectives of the AST Foundations Course Anderson / Krathwohl level
1 Familiar with basic terminology and how it will be used in the BBST courses Understand
2 Aware of honest and rational controversy over definitions of common concepts and terms in the field Understand
3 Understand there are legitimately different missions for a testing effort. Understand the argument that selection of mission depends on contextual factors . Able to evaluate relatively simple situations that exhibit strongly different contexts in terms of their implication for testing strategies. Understand, Simple evaluation
4 Understand the concept of oracles well enough to apply multiple oracle heuristics to their own work and explain what they are doing and why Understand and apply
5 Understand that complete testing is impossible. Improve ability to estimate and explain the size of a testing problem. Understand, rudimentary application
6 Familiarize students with the concept of measurement dysfunction Understand
7 Improve students’ ability to adjust their focus from narrow technical problems (such as analysis of a single function or parameter) through broader, context-rich problems Analyze
8 Improve online study skills, such as learning more from video lectures and associated readings Apply
9 Improve online course participation skills, including online discussion and working together online in groups Apply
10 Increase student comfort with formative assessment (assessment done to help students take their own inventory, think and learn rather than to pass or fail the students) Apply

For each of these objectives, we could list the items that we want students to learn. For example:

  • list the terms that students should be able to define
  • list the divergent definitions that students should be aware of
  • list the online course participation skills that students should develop or improve.

We could create multiple choice tests for some of these:

  • We could check whether students could recognize a term’s definition.
  • We could check whether students could recognize some aspect of an online study skill.

But there are elements in the list that aren’t easy to assess with a multiple choice test. For example, how can you tell whether someone works well with other students by asking them multiple choice questions? To assess that, you should watch how they work in groups, not read multiple-choice answers.

Now, back to Haladyna’s first guideline:

  • Use an appropriate type of test for each content item. Multiple choice is good for some, but not all.
  • If you use a multiple choice test, each test item (each question) should focus on a single content item. That might be a complex item, such as a rule or a relationship or a model, but it should be something that you and the student would consider to be one thing. A question spread across multiple issues is confusing in ways that have little to do with the content being tested.
  • Design the test item to assess the material at the right level (see the grid, above). For example, if you are trying to learn whether someone can use a model to evaluate a situation, you should ask a question that requires the examinee to apply the model, not one that just asks whether she can remember the model.

When we work with a self-contained learning unit, such as the individual AST BBST courses and the engineering ethics units, it should be possible to list most of the items that students should learn and the associated cognitive level.

However, for the Open Certification exam, the listing task is much more difficult because it is fair game to ask about any of the field’s definitions, facts, concepts, models, skills, etc. None of the “Body of Knowledge” lists are complete, but we might use them as a start for brainstorming about what would be useful questions for the exam.

The Open Certification (OC) exam is different from other high-stakes exams because the OC question database serves as a study guide. Questions that might be too hard in a surprise-test (a test with questions you’ve never seen before) might be instructive in a test database that prepares you for an exam derived from the database questions–especially when the test database includes discussion of the questions and answers, not just the barebones questions themselves.

2. Base each item on important content to learn; avoid trivial content.

The heuristic for Open Certification is: Don’t ask the question unless you think a hiring manager would actually care whether this person knew the answer to it.

3. Use novel material to meaure understanding and the application of knowledge and skills.

That is, reword the idea you are asking about rather than using the same words as the lecture or assigned readings. This is important advice for a traditional surprise test because people are good matchers:

  • If I show you exactly the same thing that you saw before, you might recognize it as familiar even if you don’t know what it means.
  • If I want to be a nasty trickster, I can put exact-match (but irrelevant) text in a distractor. You’ll be more likely to guess this answer (if you’re not sure of the correct answer) because this one is familiar.

This is important advice for BBST because the student can match the words to the readings (in this open book test) without understanding them. In the open book exam, this doesn’t even require recall.

On the other hand, especially in the open book exams, I like to put exact matches in the stem. The stem is asking a question like, What does this mean? or What can you do with this? If you use textbook phrases to identify the this, then you are helping the student figure out where to look for possible answers. In the open book exam, the multiple choice test is a study aid. It is helpful to orient the student to something you want him to think about and read further about.

4. Keep the content of an item independent from content of other items on the test.

Suppose that you define a term in one question and then ask how to apply the concept in the next. The student who doesn’t remember the definition will probably be able to figure it out after reading the next question (the application).

It’s a common mistake to write an exam that builds forward without realizing that the student can read the questions and answer them in any order.

5. Avoid overspecific and overgeneral content.

The concern with questions that are overly specific is that they are usually trivial. Does it really matter what year Boris Beizer wrote his famous Software Testing Techniques? Isn’t it more important to know what techniques he was writing about and why?

There are some simple facts that we might expect all testers to know.

For example, what’s the largest ASCII code in the lower ASCII character set, and what character does it signify?

The boundary cases for ASCII might be core testing knowledge, and thus fair game.

However, in most cases, facts are easy to look up in books or with an electronic search. Before asking for a memorized fact, ask why you would care whether the tester had memorized that fact or not.

The concern with questions that are overly general is that they are also usually trivial–or wrong–or both.

6. Avoid opinion-based items.

This is obvious, right? A question is unfair if it asks for an answer that some experts would consider correct and rejects an answer that other experts would consider correct.

But we have this problem in testing.

There are several mutually exclusive definitions of “test case.” There are strong professional differences about the value of a test script or the utility of the V-model or even whether the V-model was implicit in the waterfall model (read the early papers) or a more recent innovation.

Most of the interesting definitions in our field convey opinions, and the Standards that assert the supposedly-correct definitions get that way by ignoring the controversies.

What tactics can we use to deal with this?

a. The qualified opinion.

For example, consider this question:

“The definition of exploratory testing is…”

and this answer:

“a style of software testing that emphasizes the personal freedom and responsibility of the individual tester to continually optimize the value of her work by treating test-related learning, test design, test execution, and test result interpretation as mutually supportive activities that run in parallel throughout the project.”

Is the answer correct or not?

Some people think that exploratory testing is bound tightly to test execution; they would reject the definition.

On the other hand, if we changed the question to,

“According to Cem Kaner, the definition of exploratory testing is…”

that long definition would be the right answer.

Qualification is easy in the BBST course because you can use the qualifier, According to the lecture. This is what the student is studying right now and the exam is open book, so the student can check the fact easily.

Qualification is more problematic for closed-book exams like the certification exam. In this general case, can we fairly expect students to know who prefers which definition?

The problem is that qualified opinions contain an often-trivial fact. Should we really expect students or certification-examinees to remember definitions in terms of who said what? Most of the time, I don’t think so.

b. Drawing implications

For example, consider asking a question in one of these ways:

  • If A means X, then if you do A, you should expect the following results.
  • Imagine two definitions of A: X and Y. Which bugs would you be more likely to expose if you followed X in your testing and which if you followed Y?
  • Which definition of X is most consistent with theory Y?

7. Avoid trick items.

Haladyna (2004, p. 104) reports work by Roberts that identified several types of (intentional or unintentional) tricks in questions:

    1. The item writer’s intention appeared to deceive, confuse, or mislead test takers.
    2. Trivial content was represented (which vilates one of our item-writing guidelines)
    3. The discrimination among options was too fine.
    4. Items had window dressing that was irrelevant to the problem.
    5. Multiple correct answers were possible.
    6. Principles were presented in ways that were not learned, thus deceiving students.
    7. Items were so highly ambiguous that even the best students had no idea about the right answer.

Some other tricks that undermine accurate assessment:

    1. Put text in a distractor that is irrelevant to the question but exactly matches something from the assigned readings or the lecture.
    2. Use complex logic (such as not (A and B) or a double negative) — unless the learning being tested involves complex logic.
    3. Accurately qualify a widely discreted view: According to famous-person, the definition of X is Y, where Y is a definition no one accepts any more, but famous-person did in fact publish it.
    4. In the set of items for a question, leave grammatical errors in all but the second-best choice. (Many people will guess that the grammatically-correct answer is the one intended to be graded as correct.)

Items that require careful reading are not necessarily trick items. This varies from field to field. For example, my experience with exams for lawyers and law students is that they often require very precise reading. Testers are supposed to be able to do very fine-grained specification analysis.

Consider Example D:

D. The key difference between black box testing and behavioral testing is that:

The options include several differences that students find plausible. Every time I give this question, some students choose a combination answer (such as (a) and (b)). This is a mistake, because the question calls for “The key difference,” and that cannot be a collection of two or more differences.

Consider Example E:

E. What is the significance of the difference between black box and glass box tests?

A very common mistake is to choose this answer:

Glass box tests focus on the internals of the program whereas black box tests focus on the externally visible behavior.

The answer is an accurate description of the difference, but it says nothing about the significance of the difference. Why would someone care about the difference? What is the consequence of the difference?

Over time, students learn to read questions like this more carefully. My underlying assumption is that they are also learning or applying, in the course of this, skills they need to read technical documents more carefully. Those are important skills for both software testing and legal analysis and so they are relevant to the courses that are motivating this tutorial. However, for other courses, questions like these might be less suitable.

On a high-stakes exam, with students who had not had a lot of exam-preparation training, I would not ask these questions because I would not expect students to be prepared for them. On the high-stakes exam, the ambiguity of a wrong answer (might not know the content vs. might not have parsed the question carefully) could lead to the wrong conclusion about the student’s understanding of the material.

In contrast, in an instructional context in which we are trying to teach students to parse what they read with care, there is value in subjecting students to low-risk reminders to read with care.

STYLE AND FORMAT CONCERNS

8. Format items vertically instead of horizontally.

If the options are brief, you could format them as a list of items, one beside the next. However, these lists are often harder to read and it is much harder to keep formatting consistent across a series of questions.

9. Edit items for clarity.

I improve the clarity of my test items in several ways:

  • I ask colleagues to review the items.
  • I coteach with other instructors or with teaching assistants. They take the test and discuss the items with me.
  • I encourage students to comment on test items. I use course management systems, so it is easy to set up a question-discussion forum for students to query, challenge or complain about test items.

In my experience, it is remarkable how many times an item can go through review (and improvement) and still be confusing.

10. Edit items for correct grammar, punctuation, capitalization and spelling.

It is common for instructors to write the stem and the correct choice together when they first write the question. The instructor words the distractors later, often less carefully and in some way that is inconsistent with the correct choice. These differences become undesirable clues about the right and wrong choices.

11. Simplify vocabulary so that reading comprehension does not interfere with testing the content intended.

There’s not much point asking a question that the examinee doesn’t understand. If the examinee doesn’t understand the technical terms (the words or concepts being tested), that’s one thing. But if the examinee doesn’t understand the other terms, the question simply won’t reach the examinee’s knowledge.

12. Minimize reading time. Avoid excessive verbiage.

Students whose first language is not English often have trouble with long questions.

13. Proofread each item.

Despite editorial care, remarkably many simple mistakes survive review or are introduced by mechanical error (e.g. cutting and pasting from a master list to the test itself).

WRITING THE STEM

14. Make the directions as clear as possible.

Consider the following confusingly-written question:

A program will accept a string of letters and digits into a password field. After it accepts the string, it asks for a comparison string, and on accepting a new input from the customer, it compares the first string against the second and rejects the password entry if the strings do not match.

  1. There are 218340105584896 possible tests of 8-character passwords.
  2. This method of password verification is subject to the risk of input-buffer overflow from an excessively long password entry
  3. This specification is seriously ambiguous because it doesn’t tell us whether the program accepts or rejects/filters non-alphanumeric characters into the second password entry

Let us pretend that each of these answers could be correct. Which is correct for this question? Is the stem calling for an analysis of the number of possible tests, the risks of the method, the quality of the specification, or something else?

The stem should make clear whether the question is looking for the best single answer or potentially more than one, and whether the question is asking for facts, opinion, examples, reasoning, a calculation, or something else.

The reader should never have to read the set of possible answers to understand what the question is asking.

15. Make the stem as brief as possible.

This is part of the same recommendation as Heuristic #12 above. If the entire question should be as short as possible (#12), the stem should be as short as possible.

However, “as short as possible” does not necessarily mean “short.”

Here are some examples:

  • The stem describes some aspect of the program in enough detail that it is possible to compute the number of possible software test cases. The choices include the correct answer and three miscalculations.
  • The stem describes a software development project in enough detail that the reader can see the possibility of doing a variety of tasks and the benefits they might offer to the project, and then asks the reader to prioritize some of the tasks. The choices are of the form, “X is more urgent than Y.”
  • The stem describes a potential error in the code, the types of visible symptoms that this error could cause, and then calls for selection of the best test technique for exposing this type of bug.
  • The stem quotes part of a product specification and then asks the reader to identify an ambiguity or to identify the most serious impact on test design an ambiguity like this might cause.
  • The stem describes a test, a failure exposed by the test, a stakeholder (who has certain concerns) who receives failure reports and is involved in decisions about the budget for the testing effort, and asks which description of the failure would be most likely to be perceived as significant by that stakeholder. An even more interesting question (faced frequently by testers in the real world) is which description would be perceived as significant (credible, worth reading and worth fixing) by Stakeholder 1 and which other description would be more persuasive for Stakeholder 2. (Someone concerned with next months’ sales might assess risk very differently from someone concerned with engineering / maintenance cost of a product line over a 5-year period. Both concerns are valid, but a good tester might raise different consequences of the same bug for the marketer than for the maintenance manager).

Another trend for writing test questions that address higher-level learning is to write a very long and detailed stem followed by several multiple choice questions based on the same scenario.

Long questions like these are fair game (normal cases) in exams for lawyers, such as the Multistate Bar Exam. They are looked on with less favor in discplines that don’t demand the same level of skill in quickly reading/understanding complex blocks of text. Therefore, for many engineering exams (for example), questions like these are probably less popular.

  • They discriminate against people whose first language is not English and who are therefore slower readers of complex English text, or more generally against anyone who is a slow reader, because the exam is time-pressed.
  • They discriminate against people who understand the underlying material and who can reach an application of that material to real-life-complexity circumstances if they can work with a genuine situation or a realistic model (something they can appreciate in a hands-on way) but who are not so good at working from hypotheticals that abstract out all information that the examiner considers inessential.
  • They can cause a cascading failure. If the exam includes 10 questions based on one hypothetical and the examinee misunderstands that one hypothetical, she might blow all 10 questions.
  • They can demoralize an examinee who lacks confidence/skill with this type of question, resulting in a bad score because the examinee stops trying to do well on the test.

However, in a low-stakes exam without time limits, those concerns are less important. The exam becomes practice for this type of analysis, rather than punishment for not being good at it.

In software testing, we are constantly trying to simplify a complex product into testable lines of attack. We ignore most aspects of the product and design tests for a few aspects, considered on their own or in combination with each other. We build explicit or implicit mental models of the product under test, and work from those to the tests, and from the tests back to the models (to help us decide what the results should be). Therefore, drawing out the implications of a complex system is a survival skill for testers and questions of this style are entirely fair game–in a low stakes exam, designed to help the student learn, rather than a high-stakes exam designed to create consequences based on an estimate of what the student knows.

16. Place the main idea of the item in the stem, not in the choices.

Some instructors adopt an intentional style in which the stem is extremely short and the question is largely defined in the choices.

The confusingly-written question in Heuristic #14 was an example of a case in which the reader can’t tell what the question is asking until he reads the choices. In #14, there were two problems:

  • the stem didn’t state what question it was asking
  • the choices themselves were fundamentally different, asking about different dimensions of the situation described in the stem rather than exploring one dimension with a correct answer and distracting mistakes. The reader had to guess / decide which dimension was of interest as well as deciding which answer might be correct.

Suppose we fix the second problem but still have a stem so short that you don’t know what the question is asking for until you read the options. That’s the issue addressed here (Heuristic #16).

For example, here is a better-written question that doesn’t pass muster under Heuristic #16:

A software oracle:

  1. is defined this way
  2. is defined this other way
  3. is defined this other way

The better question under this heuristic would be:

What is the definition of a software oracle?

  1. this definition
  2. this other definition
  3. this other other definition

As long as the options are strictly parallel (they are alternative answers to the same implied question), I don’t think this is a serious a problem.

17. Avoid irrelevant information (window dressing).

Imagine a question that includes several types of information in its description of some aspect of a computer program:

  • details about how the program was written
  • details about how the program will be used
  • details about the stakeholders who are funding or authorizing the project
  • details about ways in which products like this have failed before

All of these details might be relevant to the question, but probably most of them are not relevant to any particular question. For example, to calculate the theoretically-possible number of tests of part of the program doesn’t require any knowledge of the stakeholders.

Information:

  • is irrelevant if you don’t need it to determine which option is the correct answer
  • unless the reader’s ability to wade through irrelevant information of this type in order to get to the right underlying formula (or generally, the right approach to the problem) is part of the

18. Avoid negative words in the stem.

Here are some examples of stems with negative structure:

  • Which of the following is NOT a common definition of software testing?
  • Do NOT assign a priority to a bug report EXCEPT under what condition(s)?
  • You should generally compute code coverage statistics UNLESS:

For many people, these are harder than questions that ask for the same information in a positively-phrased way.

There is some evidence that there are cross-cultural variations. That is, these questions are harder for some people than others because (probably) of their original language training in childhood. Therefore, a bad result on this question might have more to do with the person’s heritage than with their knowledge or skill in software testing.

However, the ability to parse complex logical expressions is an important skill for a tester. Programmers make lots of bugs when they write code to implement things like:

NOT (A OR B) AND C

So testers have to be able to design tests that anticipate the bug and check whether the programmer made it.

It is not unfair to ask a tester to handle some complex negation, if your intent is to test whether the tester can work with complex logical expressions. But if you think you are testing something else, and your question demands careful logic processing, you won’t know from a bad answer whether the problem was the content you thought you were testing or the logic that you didn’t consider.

Another problem is that many people read negative sentences as positive. Their eyes glaze over when they see the NOT and they answer the question as if it were positive (Which of the following IS a common definition of software testing?) Unless you are testing for glazy eyes, you should make the negation as visible as possible I use ITALICIZED ALL-CAPS BOLDFACE in the examples above.

WRITING THE CHOICES (THE OPTIONS)

19. Develop as many effective options as you can, but two or three may be sufficient.

Imagine an exam with 100 questions. All of them have two options. Someone who is randomly guessing should get 50% correct.

Now imagine an exam with 100 questions that all have four options. Under random guessing, the examinee should get 25%.

The issue of effectiveness is important because an answer that is not credible (not effective) won’t gain any guesses. For example, imagine that you saw this question on a quiz in a software testing course:

Green-box testing is:

  1. common at box manufacturers when they start preparing for the Energy Star rating
  2. a rarely-taught style of software testing
  3. a nickname used by automobile manufacturers for tests of hybrid cars
  4. the name of Glen Myers’ favorite book

I suspect that most students would pick choice 2 because 1 and 3 are irrelevant to the course and 4 is ridiculous (if it was a proper name, for example, “Green-box testing” would be capitalized.) So even though there appear to be 4 choices, there is really only 1 effective one.

The number of choices is important, as is the correction-for-guessing penalty, if you are using multiple choice test results to assign a grade or assess the student’s knowledge in way that carries consequences for the student.

The number of choices — the final score — is much less important if the quiz is for learning support rather than for assessment.

The Open Certification exam is for assessment and has a final score, but it is different from other exams in that examinees can review the questions and consider the answers in advance. Statistical theories of scoring just don’t apply well under those conditions.

20. Vary the location of the right answer according to the number of options. Assign the position of the right answer randomly.

There’s an old rule of thumb–if you don’t know the answer, choose the second one in the list. Some inexperienced exam-writers tend to put the correct answer in the same location more often than if they varied location randomly. Experienced exam-writers use a randomization method to eliminate this bias.

21. Place options in logical or numerical order.

The example that Haladyna gives is numeric. If you’re going to ask the examinee to choose the right number from a list of choices, then present them in order (like $5, $10, $20, $175) rather than randomly (like $20, $5, $175, $20).

In general, the idea underlying this heuristic is that the reader is less likely to make an accidental error (one unrelated to their knowledge of the subject under test) if the choices are ordered and formatted in the way that makes them as easy as possible to read quickly and understand correctly.

22. Keep options independent; choices should not be overlapping.

Assuming standard productivity metrics, how long should it take to create and document 100 boundary tests of simple input fields?

  1. 1 hour or less
  2. 5 hours or less
  3. between 3 and 7 hours
  4. more than 6 hours

These choices overlap. If you think the correct answer is 4 hours, which one do you pick as the correct answer?

Here is a style of question that I sometimes use that might look overlapping at first glance, but is not:

What is the best course of action in context C?

  1. Do X because of RY (the reason you should do Y).
  2. Do X because of RX (the reason you should do X, but a reason that the examinee is expected to know is impossible in context C)
  3. Do Y because of RY (the correct answer)
  4. Do Y because of RX

Two options tell you to do Y (the right thing to do), but for different reasons. One reason is appropriate, the other is not. The test is checking not just whether the examinee can decide what to do but whether she can correctly identify why to do it. This can be a hard question but if you expect a student to know why to do something, requiring them to pick the right reason as well as the right result is entirely fair.

23. Keep the options homogeneous in content and grammatical structure.

Inexperienced exam writers often accidentally introduce variation between the correct answer and the others. For example, the correct answer:

  • might be properly punctuated
  • might start with a capital letter (or not start with one) unlike the others
  • might end with a period or semi-colon (unlike the others)
  • might be present tense (the others in past tense)
  • might be active voice (the others in passive voice), etc.

The most common reason for this is that some exam authors write a long list of stems and correct answers, then fill the rest of the questions in later.

The nasty, sneaky tricky exam writer knows that test-wise students look for this type of variation and so introduces it deliberately:

Which is the right answer?

  1. this is the right answer
  2. This is the better-formatted second-best answer.
  3. this is a wrong answer
  4. this is another wrong answer

The test-savvy guesser will be drawn to answer 2 (bwaa-haaa-haa!)

Tricks are one way to keep down the scores of skilled guessers, but when students realize that you’re hitting them with trick questions, you can lose your credibility with them.

24. Keep the length of options about the same.

Which is the right answer?

  1. this is the wrong answer
  2. This is a really well-qualified and precisely-stated answer that is obviously more carefully considered than the others, so which one do you think is likely to be the right answer?.
  3. this is a wrong answer
  4. this is another wrong answer

25. “None of the above” should be used carefully.

As Haladyna points out, there is a fair bit of controversy over this heuristic:

  • If you use it, make sure that you make it the correct answer sometimes and the incorrect answer sometimes
  • Use it when you are trying to make the student actually solve a problem and assess the reasonability of the possible solutions

26. Avoid using “all of the above.”

The main argument against “all of the above” is that if there is an obviously incorrect option, then “all of the above” is obviously incorrect too. Thus, test-wise examinees can reduce the number of plausible options easily. If you are trying to statistically model the difficulty of the exam, or create correction factors (a “correction” is a penalty for guessing the wrong answer), then including an option that is obviously easier than the others makes the modeling messier.

In our context, we aren’t “correcting” for guessing or estimating the difficulty of the exam:

  • In the BBST (open book) exam, the goal is to get the student to read the material carefully and think about it. Difficulty of the question is more a function of difficulty of the source material than of the question.
  • In the Open Certification exam, every question appears on a public server, along a justification of the intended-correct answer and public commentary. Any examinee can review these questions and discussions. Some will, some won’t, some will remember what they read and some won’t, some will understand what they read and some won’t–how do you model the difficulty of questions this way? Whatever the models might be, the fact that the “all of the above” option is relatively easy for some students who have to guess is probably a minor factor.

Another argument is more general. Several authors, including Haladyna, Downing, & Rodriguez (2002), recommend against the complex question that allows more than one correct answer. This makes the question more difficult and more confusing for some students.

Even though some authors recommend against it, our question construction adopt a complex structure that allows selection of combinations (such as (a) and (b) as well as all of the above) — because other educational researchers consider this structure a useful vehicle for presenting difficult questions in a fair way. See for example Wongwiwatthananukit, Popovich & Bennett (2000) and their references.

Note that in the BBST / Open Certification structure, the fact that there is a combination choice or an all of the above choice is not informative because most questions have these.

There is a particular difficulty with this structure, however. Consider this question:

Choose the answer:

  1. This is the best choice
  2. This is a bad choice
  3. This is a reasonable answer, but (a) is far better–or this is really a subset of (a), weak on its own but it would be the only correct one if (a) was not present.
  4. (a) and (b)
  5. (a) and (c)
  6. (b) and (c)
  7. (a) and (b) and (c)

In this case, the student will have an unfairly hard time choosing between (a) and (e). We have created questions like this accidentally, but when we recognize this problem, we fix it in one of these ways:

Alternative 1. Choose the answer:

    1. This is the best choice
    2. This is a bad choice
    3. This is a reasonable answer, but (a) is far better–or this is really a subset of (a), weak on its own but it would be the only correct one if (a) was not present.
    4. This is a bad choice
    5. (a) and (b)
    6. (b) and (c)
    7. (a) and (b) and (c)

    In this case, we make sure that (a) and (c) is not available for selection.

Alternative 2. Choose the answer:

    1. This is the best choice
    2. This is a bad choice
    3. This is a reasonable answer, but (a) is far better–or this is really a subset of (a), weak on its own but it would be the only correct one if (a) was not present.
    4. This is a bad choice

    In this case, no combinations are available for selection.

27. Avoid negative words such as not or except.

This is the same advice, for the options, as we provided in Heuristic #18 for the stem, for the same reasons.

28. Avoid options that give clues to the right answer.

Some of the mistakes mentioned by Haladyna, Downing, & Rodriguez (2002) are:

  • Broad assertions that are probably incorrect, such as always, never, must, and absolutely.
  • Choices that sound like words in the stem, or words that sound like the correct answer
  • Grammatical inconsistencies, length inconsistencies, formatting inconsistencies, extra qualifiers or other obvious inconsistencies that point to the correct choice
  • Pairs or triplet options that point to the correct choice. For example, if every combination option includes (a) (such as (a) and (b) and (a) and (c) and all of the above) then it is pretty obvious that (a) is probably correct and any answer that excludes (a) (such as (b)) is probably wrong.

29. Make all distractors plausible.

This is important for two reasons:

  • If you are trying to do statistical modeling of the difficulty of the exam (“There are 4 choices in this question, therefore there is only a 25% chance of a correct answer from guessing”) then implausible distractors invalidate the model because few people will make this guess. However, in our tests, we aren’t doing this modeling so this doesn’t matter.
  • An implausible choice is a waste of space and time. If no one will make this choice, it is not really a choice. It is just extra text to read.

One reason that an implausible distractor is sometimes valuable is that sometimes students do pick obviously unreasonable distractors. In my experience, this happens when the student is:

  • ill, and not able to concentrate
  • falling asleep, and not able to concentrate
  • on drugs or drunk, and not able to concentrate or temporarily inflicted with a very strange sense of humor
  • copying answers (in a typical classroom test, looking at someone else’s exam a few feet away) and making a copying mistake.

I rarely design test questions with the intent of including a blatantly implausible option, but I am an inept enough test-writer that a few slip by anyway. These aren’t very interesting in the BBST course, but I have found them very useful in traditional quizzes in the traditionally-taught university course.

30. Use typical errors of students when you write distractors.

Suppose that you gave a fill-in-the-blank question to students. In this case, for example, you might ask the student to tell you the definition rather than giving students a list of definitions to choose from. If you gathered a large enough sample of fill-in-the-blank answers, you would know what the most common mistakes are. Then, when you create the multiple choice question, you can include these as distractors. The students who don’t know the right answer are likely to fall into one of the frequently-used wrong answers.

I rarely have the opportunity to build questions this way, but the principle carries over. When I write a question, I ask “If someone was going to make a mistake, what mistake would they make?”

31. Use humor if it is compatible with the teacher; avoid humor in a high-stakes test.

Robert F. McMorris, Roger A. Boothroyd, & ‌Debra J. Pietrangelo (1997) and Powers (2005) advocate for carefully controlled use of humor in tests and quizzes. I think this is reasonable in face-to-face instruction, once the students have come to know the instructor (or in a low-stakes test while students are getting to know the instructor). However, in a test that involves students from several cultures, who have varying degrees of experience with the English language, I think humor in a quiz can create more confusion and irritation than it is worth.

References

These notes summarize lessons that came out of the last Workshop on Open Certification (WOC 2007) and from private discussions related to BBST.

There’s a lot of excellent advice on writing multiple-choice test questions. Here are a few sources that I’ve found particularly helpful:

  1. Lorin Anderson, David Krathwohl, & Benjamin Bloom, Taxonomy for Learning, Teaching, and Assessing, A: A Revision of Bloom’s Taxonomy of Educational Objectives, Complete Edition, Longman Publishing, 2000.
  2. National Conference of Bar Examiners, Multistate Bar Examination Study Aids and Information Guides.
  3. Steven J. Burton, Richard R. Sudweeks, Paul F. Merrill, Bud Wood, How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty, Brigham Young University Testing Services, 1991.
  4. Thomas M. Haladyna, Writing Test Items to Evaluate Higher Order Thinking, Allyn & Bacon, 1997.
  5. Thomas M. Haladyna, Developing and Validating Multiple-Choice Test Items, 3rd Edition, Lawrence Erlbaum, 2004.
  6. Thomas M. Haladyna, Steven M. Downing, Michael C. Rodriguez, A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment, Applied Measurement in Education, 15(3), 309–334, 2002.
  7. Robert F. McMorris, Roger A. Boothroyd, & ‌Debra J. Pietrangelo, Humor in Educational Testing: A Review and Discussion, Applied Measurement in Education, 10(3), 269-297, 1997.
  8. Ted Powers, Engaging Students with Humor, Association for Psychological Science Observer, 18(12), December 2005.
  9. The Royal College of Physicians and Surgeons of Canada, Developing Multiple Choice Questions for the RCPSC Certification Examinations.
  10. Supakit Wongwiwatthananukit, Nicholas G. Popovich, & Deborah E. Bennett, Assessing pharmacy student knowledge on multiple-choice examinations using partial-credit scoring of combined-response multiple-choice items, American Journal of Pharmaceutical Education, Spring, 2000.
  11. Bibliography and links on Multiple Choice Questions at http://ahe.cqu.edu.au/MCQ.htm

References to my blogs

One Response to “Writing Multiple Choice Test Questions”

  1. A well thought out post. I don’t have any experience in creating exams, but I do have experience in taking some. There were some experiences that I found distressing when taking multiple choice exams at University and would be interested in your thoughts. (I did read through most of the post and couldn’t see them addressed, if they are I apologise).

    - Too many options. One exam I took had an average of 12 options per questions, with some questions having up to 17 options. Most were a mix, so a + b but not d, e or f and various combinations there off. Do you have a thought about the optimum number of options?

    I typically restrict my questions to 7 options. I use combination questions (a+b) and the literature on that is mixed.– Cem

    - I took an exam where you got 2 points for a right answer, -1 for a wrong answer and -1 for no answer. When the lecturer was asked about the points system they said they wanted to discourage guessing. I don’t believe that -1 for a wrong answer and -1 for no answer actually achieved that goal. What are your thoughts on negative marks for wrong answers?

    I don’t use them. Guessing is a big problem in high-stakes tests. I don’t use multiple-choice questions for high-stakes testing and so I don’t face the problem.– Cem

    - Another exam (happened to be the same as the first I mentioned) gave partial points for getting one of the correct options. For example if the question had option a, b, c and d and then the rest were combinations of a + c, a + c + d etc if a + b where the correct answer someone who selected a + c would get partial marks because “a” was one of the correct answers. I would be interested in your thoughts on this technique of awarding partial marks.

    We lay out some partial-points standards in the main post, above. (I see “we” because this has been subject to a lot of discussion in the Open Certification process.) I don’t give partial points for combinations that include any wrong answer, but do give partial points for answers that get some but not all of the elements of a combination. Other reasonable alternatives exist. Settling on one consistent rule that is clearly stated seems as important as adopting the one best rule.– Cem