October « 2007 « Cem Kaner, J.D., Ph.D.

Archive for October, 2007

Writing Multiple Choice Test Questions

Wednesday, October 24th, 2007

SUMMARY

This is a tutorial on creating multiple choice questions, framed by Haladyna’s heuristics for test design and Anderson & Krathwohl’s update to Bloom’s taxonomy. My interest in computer-gradable test questions is to support teaching and learning rather than high-stakes examination. Some of the design heuristics are probably different for this case. For example, which is the more desirable attribute for a test question:

defensibility (you can defend its fairness and appropriateness to a critic) or
potential to help a student gain insight?

In high-stakes exams, (a) [defensibility] is clearly more important, but as a support for learning, I’d rather have (b) [support for insight].

This tutorial’s examples are from software engineering, but from my perspective as someone who has also taught psychology and law, I think the ideas are applicable across many disciplines.

The tutorial’s advice and examples specifically target three projects:

In the Black Box Software Testing Course [some course materials here], students take the multiple choice tests while they watch the video lectures or work through the assigned readings [research description here].
We are following the same structure for learning units for graduate student instruction in software engineering ethics.
In the Open Certification Project for Software Testing we are creating a public database of questions, with peer commentary/criticism. Anyone can review the questions, including people preparing for the exam. For the rationale behind this approach, see this paper by Kaner and Tim Coulter.

Standards specific to the BBST and Open Certification Questions
Definitions and Examples
Item Writing Heuristics
References

STANDARDS SPECIFIC TO THE BBST AND OPEN CERTIFICATION QUESTIONS

1. Consider a question with the following structure:

Choose the answer:

First option

Second option

The typical way we will present this question is:

Choose the answer:

First option

Second option

Both (a) and (b)

Neither (a) nor (b)

If the correct answer is (c) then the examinee will receive 25% credit for selecting only (a) or only (b).

2. Consider an question with the following structure:

Choose the answer:

First option

Second option

Third option

The typical way we will present this question is:

Choose the answer:

First option

Second option

Third option

(a) and (b)

(a) and (c)

(b) and (c)

(a) and (b) and (c)

If the correct answer is (d), the examinee will receive 25% credit for selecting only (a) or only (b). Similarly for (e) and (f).

If the correct answer is (g) (all of the above), the examinee will receive 25% credit for selecting (d) or (e) or (f) but nothing for the other choices.

3. Consider an question with the following structure:

Choose the answer:

First option

Second option

Third option

Fourth option

The typical ways we might present this question are:

Choose the answer:

First option

Second option

Third option

Fourth option

OR

Choose the answer:

First option

Second option

Third option

Fourth option

(a) and (c)

(a) and (b) and (d)

(a) and (b) and (c) and (d)

There will be a maximum of 7 choices.

The three combination choices can be any combination of two, three or four of the first four answers.

If the correct answer is like (e) (a pair), the examinee will receive 25% credit for selecting only (a) or only (b) and nothing for selecting a combination that includes (a) and (b) but also includes an incorrect choice.

If the correct answer is (f) (three of the four), the examinee will receive 25% credit for selecting a correct pair (if (a) and (b) and (d) are all correct, then any two of them get 25%) but nothing for selecting only one of the three or selecting a choice that includes two or three correct but also includes an incorrect choice.

If the correct answer is (g) (all correct), the examinee will receive a 25% credit for selecting a correct triple.

DEFINITIONS AND EXAMPLES

Definitions

Here are a few terms commonly used when discussing the design of multiple choice questions. See the Reference Examples, below.

Test: In this article, the word “test” is ambiguous. Sometimes we mean a software test (an experiment that can expose problems in a computer program) and sometimes an academic test (a question that can expose problems in someone’s knowledge). In these definitions, “test” means “academic test.”

Test item: a test item is a single test question. It might be a multiple choice test question or an essay test question (or whatever).
Content item: a content item is a single piece of content, such as a fact or a rule, something you can test on.

Stem: The opening part of the question is called the stem. For example, “Which is the best definition of the testing strategy in a testing project? ” is Reference Example B’s stem.

Distractor: An incorrect answer. In Reference Example B, (b) and (c) are distractors.

Correct choice: The correct answer for Reference Example B is (a) “The plan for applying resources and selecting techniques to achieve the testing mission.”

The Question format: The stem is a complete sentence and asks a question that is answered by the correct choice and the distractors. Reference Example A has this format.

The Best Answer format: The stem asks a complete question. Most or all of the distractors and the correct choice are correct to some degree, but one of them is stronger than the others. In Reference Example B, all three answers are plausible but in the BBST course, given the BBST lectures, (a) is the best.

The Incomplete Stem format: The stem is an incomplete sentence that the correct choice and distractors complete. Reference Example C has this format.

Complex formats: In a complex-format question, the alternatives include simple answers and combinations of these answers. In Reference Example A, the examinee can choose (a) “We can never be certain that the program is bug free” or (d) which says that both (a) and (b) are true or (f) which says that all of the simple answers (a, b and c) are true.

Learning unit: A learning unit typically includes a limited set of content that shares a common theme or purpose, plus learning support materials such as a study guide, test items, an explicit set of learning objectives, a lesson plan, readings, lecture notes or video, etc.

High-stakes test: A test is high-stakes if there are significant benefits for passing the test or significant costs of failing it.

The Reference Examples

For each of the following, choose one answer.

A. What are some important consequences of the impossibility of complete testing?

We can never be certain that the program is bug free.

We have no definite stopping point for testing, which makes it easier for some managers to argue for very little testing.

We have no easy answer for what testing tasks should always be required, because every task takes time that could be spent on other high importance tasks.

(a) and (b)

(a) and (c)

(b) and (c)

All of the above

B. Which is the best definition of the testing strategy in a testing project?

The plan for applying resources and selecting techniques to achieve the testing mission.

The plan for applying resources and selecting techniques to assure quality.

The guiding plan for finding bugs.

C. Complete statement coverage means …

That you have tested every statement in the program.

That you have tested every statement and every branch in the program.

That you have tested every IF statement in the program.

That you have tested every combination of values of IF statements in the program.

D. The key difference between black box testing and behavioral testing is that:

The test designer can use knowledge of the program’s internals to develop a black box test, but cannot use that knowledge in the design of a behavioral test because the behavioral test is concerned with behavior, not internals.

The test designer can use knowledge of the program’s internals to develop a behavioral test, but cannot use that knowledge in the design of a black box test because the designer cannot rely on knowledge of the internals of the black box (the program).

The behavioral test is focused on program behavior whereas the black box test is concerned with system capability.

(a) and (b)

(a) and (c)

(b) and (c)

(a) and (b) and (c)

E. What is the significance of the difference between black box and glass box tests?

Black box tests cannot be as powerful as glass box tests because the tester doesn’t know what issues in the code to look for.

Black box tests are typically better suited to measure the software against the expectations of the user, whereas glass box tests measure the program against the expectations of the programmer who wrote it.

Glass box tests focus on the internals of the program whereas black box tests focus on the externally visible behavior.

ITEM-WRITING HEURISTICS

Several papers on the web organize their discussion of multiple choice tests around a researched set of advice from Haladyna, Downing & Rodriguez or the updated list from Haladyna (2004). I’ll do that too, tying their advice to back to our needs for software testing.

	Content Guidelines
1	Every item should reflect specific content and a single specific cognitive process, as called for in the test specifications (table of specifications, two-way grid, test blueprint).
2	Base each item on important content to learn; avoid trivial content.
3	Use novel material to meaure understanding and the application of knowledge and skills.
4	Keep the content of an item independent from content of other items on the test.
5	Avoid overspecific and overgeneral content.
6	Avoid opinion-based items.
7	Avoid trick items.
8	Format items vertically instead of horizontally.
	Style and Format Concerns
9	Edit items for clarity.
10	Edit items for correct grammar, punctuation, capitalization and spelling.
11	Simplify vocabulary so that reading comprehension does not interfere with testing the content intended.
12	Minimize reading time. Avoid excessive verbiage.
13	Proofread each item.
	Writing the Stem
14	Make the directions as clear as possible.
15	Make the stem as brief as possible.
16	Place the main idea of the item in the stem, not in the choices.
17	Avoid irrelevant information (window dressing).
18	Avoid negative words in the stem.
	Writing Options
19	Develop as many effective options as you can, but two or three may be sufficient.
20	Vary the location of the right answer according to the number of options. Assign the position of the right answer randomly.
21	Place options in logical or numerical order.
22	Keep options independent; choices should not be overlapping.
23	Keep the options homogeneous in content and grammatical structure.
24	Keep the length of options about the same.
25	“None of the above” should be used sparingly.
26	Avoid using “all of the above.”
27	Avoid negative words such as not or except.
28	Avoid options that give clues to the right answer.
29	Make all distractors plausible.
30	Use typical errors of students when you write distractors.
31	Use humor if it is compatible with the teacher; avoid humor in a high-stakes test.

Now to apply those to our situation.

CONTENT GUIDELINES

1. Every item should reflect specific content and a single specific cognitive process, as called for in the test specifications (table of specifications, two-way grid, test blueprint).

Here are the learning objectives from the AST Foundations course. Note the grid (the table), which lists the level of knowledge and skills in the course content and defines the level of knowledge we hope the learner will achieve. For discussions of level of knowledge, see my blog entries on Bloom’s taxonomy [1] [2] [3]:

	Learning Objectives of the AST Foundations Course	Anderson / Krathwohl level
1	Familiar with basic terminology and how it will be used in the BBST courses	Understand
2	Aware of honest and rational controversy over definitions of common concepts and terms in the field	Understand
3	Understand there are legitimately different missions for a testing effort. Understand the argument that selection of mission depends on contextual factors . Able to evaluate relatively simple situations that exhibit strongly different contexts in terms of their implication for testing strategies.	Understand, Simple evaluation
4	Understand the concept of oracles well enough to apply multiple oracle heuristics to their own work and explain what they are doing and why	Understand and apply
5	Understand that complete testing is impossible. Improve ability to estimate and explain the size of a testing problem.	Understand, rudimentary application
6	Familiarize students with the concept of measurement dysfunction	Understand
7	Improve students’ ability to adjust their focus from narrow technical problems (such as analysis of a single function or parameter) through broader, context-rich problems	Analyze
8	Improve online study skills, such as learning more from video lectures and associated readings	Apply
9	Improve online course participation skills, including online discussion and working together online in groups	Apply
10	Increase student comfort with formative assessment (assessment done to help students take their own inventory, think and learn rather than to pass or fail the students)	Apply

For each of these objectives, we could list the items that we want students to learn. For example:

list the terms that students should be able to define
list the divergent definitions that students should be aware of
list the online course participation skills that students should develop or improve.

We could create multiple choice tests for some of these:

We could check whether students could recognize a term’s definition.
We could check whether students could recognize some aspect of an online study skill.

But there are elements in the list that aren’t easy to assess with a multiple choice test. For example, how can you tell whether someone works well with other students by asking them multiple choice questions? To assess that, you should watch how they work in groups, not read multiple-choice answers.

Now, back to Haladyna’s first guideline:

Use an appropriate type of test for each content item. Multiple choice is good for some, but not all.
If you use a multiple choice test, each test item (each question) should focus on a single content item. That might be a complex item, such as a rule or a relationship or a model, but it should be something that you and the student would consider to be one thing. A question spread across multiple issues is confusing in ways that have little to do with the content being tested.
Design the test item to assess the material at the right level (see the grid, above). For example, if you are trying to learn whether someone can use a model to evaluate a situation, you should ask a question that requires the examinee to apply the model, not one that just asks whether she can remember the model.

When we work with a self-contained learning unit, such as the individual AST BBST courses and the engineering ethics units, it should be possible to list most of the items that students should learn and the associated cognitive level.

However, for the Open Certification exam, the listing task is much more difficult because it is fair game to ask about any of the field’s definitions, facts, concepts, models, skills, etc. None of the “Body of Knowledge” lists are complete, but we might use them as a start for brainstorming about what would be useful questions for the exam.

The Open Certification (OC) exam is different from other high-stakes exams because the OC question database serves as a study guide. Questions that might be too hard in a surprise-test (a test with questions you’ve never seen before) might be instructive in a test database that prepares you for an exam derived from the database questions–especially when the test database includes discussion of the questions and answers, not just the barebones questions themselves.

2. Base each item on important content to learn; avoid trivial content.

The heuristic for Open Certification is: Don’t ask the question unless you think a hiring manager would actually care whether this person knew the answer to it.

3. Use novel material to meaure understanding and the application of knowledge and skills.

That is, reword the idea you are asking about rather than using the same words as the lecture or assigned readings. This is important advice for a traditional surprise test because people are good matchers:

If I show you exactly the same thing that you saw before, you might recognize it as familiar even if you don’t know what it means.

If I want to be a nasty trickster, I can put exact-match (but irrelevant) text in a distractor. You’ll be more likely to guess this answer (if you’re not sure of the correct answer) because this one is familiar.

This is important advice for BBST because the student can match the words to the readings (in this open book test) without understanding them. In the open book exam, this doesn’t even require recall.

On the other hand, especially in the open book exams, I like to put exact matches in the stem. The stem is asking a question like, What does this mean? or What can you do with this? If you use textbook phrases to identify the this, then you are helping the student figure out where to look for possible answers. In the open book exam, the multiple choice test is a study aid. It is helpful to orient the student to something you want him to think about and read further about.

4. Keep the content of an item independent from content of other items on the test.

Suppose that you define a term in one question and then ask how to apply the concept in the next. The student who doesn’t remember the definition will probably be able to figure it out after reading the next question (the application).

It’s a common mistake to write an exam that builds forward without realizing that the student can read the questions and answer them in any order.

5. Avoid overspecific and overgeneral content.

The concern with questions that are overly specific is that they are usually trivial. Does it really matter what year Boris Beizer wrote his famous Software Testing Techniques? Isn’t it more important to know what techniques he was writing about and why?

There are some simple facts that we might expect all testers to know.

For example, what’s the largest ASCII code in the lower ASCII character set, and what character does it signify?

The boundary cases for ASCII might be core testing knowledge, and thus fair game.

However, in most cases, facts are easy to look up in books or with an electronic search. Before asking for a memorized fact, ask why you would care whether the tester had memorized that fact or not.

The concern with questions that are overly general is that they are also usually trivial–or wrong–or both.

6. Avoid opinion-based items.

This is obvious, right? A question is unfair if it asks for an answer that some experts would consider correct and rejects an answer that other experts would consider correct.

But we have this problem in testing.

There are several mutually exclusive definitions of “test case.” There are strong professional differences about the value of a test script or the utility of the V-model or even whether the V-model was implicit in the waterfall model (read the early papers) or a more recent innovation.

Most of the interesting definitions in our field convey opinions, and the Standards that assert the supposedly-correct definitions get that way by ignoring the controversies.

What tactics can we use to deal with this?

a. The qualified opinion.

For example, consider this question:

“The definition of exploratory testing is…”

and this answer:

“a style of software testing that emphasizes the personal freedom and responsibility of the individual tester to continually optimize the value of her work by treating test-related learning, test design, test execution, and test result interpretation as mutually supportive activities that run in parallel throughout the project.”

Is the answer correct or not?

Some people think that exploratory testing is bound tightly to test execution; they would reject the definition.

On the other hand, if we changed the question to,

“According to Cem Kaner, the definition of exploratory testing is…”

that long definition would be the right answer.

Qualification is easy in the BBST course because you can use the qualifier, According to the lecture. This is what the student is studying right now and the exam is open book, so the student can check the fact easily.

Qualification is more problematic for closed-book exams like the certification exam. In this general case, can we fairly expect students to know who prefers which definition?

The problem is that qualified opinions contain an often-trivial fact. Should we really expect students or certification-examinees to remember definitions in terms of who said what? Most of the time, I don’t think so.

b. Drawing implications

For example, consider asking a question in one of these ways:

If A means X, then if you do A, you should expect the following results.

Imagine two definitions of A: X and Y. Which bugs would you be more likely to expose if you followed X in your testing and which if you followed Y?

Which definition of X is most consistent with theory Y?

7. Avoid trick items.

Haladyna (2004, p. 104) reports work by Roberts that identified several types of (intentional or unintentional) tricks in questions:

The item writer’s intention appeared to deceive, confuse, or mislead test takers.

Trivial content was represented (which vilates one of our item-writing guidelines)

The discrimination among options was too fine.

Items had window dressing that was irrelevant to the problem.

Multiple correct answers were possible.

Principles were presented in ways that were not learned, thus deceiving students.

Items were so highly ambiguous that even the best students had no idea about the right answer.

Some other tricks that undermine accurate assessment:

Put text in a distractor that is irrelevant to the question but exactly matches something from the assigned readings or the lecture.

Use complex logic (such as not (A and B) or a double negative) — unless the learning being tested involves complex logic.

Accurately qualify a widely discreted view: According to famous-person, the definition of X is Y, where Y is a definition no one accepts any more, but famous-person did in fact publish it.

In the set of items for a question, leave grammatical errors in all but the second-best choice. (Many people will guess that the grammatically-correct answer is the one intended to be graded as correct.)

Items that require careful reading are not necessarily trick items. This varies from field to field. For example, my experience with exams for lawyers and law students is that they often require very precise reading. Testers are supposed to be able to do very fine-grained specification analysis.

Consider Example D:

D. The key difference between black box testing and behavioral testing is that:

The options include several differences that students find plausible. Every time I give this question, some students choose a combination answer (such as (a) and (b)). This is a mistake, because the question calls for “The key difference,” and that cannot be a collection of two or more differences.

Consider Example E:

E. What is the significance of the difference between black box and glass box tests?

A very common mistake is to choose this answer:

Glass box tests focus on the internals of the program whereas black box tests focus on the externally visible behavior.

The answer is an accurate description of the difference, but it says nothing about the significance of the difference. Why would someone care about the difference? What is the consequence of the difference?

Over time, students learn to read questions like this more carefully. My underlying assumption is that they are also learning or applying, in the course of this, skills they need to read technical documents more carefully. Those are important skills for both software testing and legal analysis and so they are relevant to the courses that are motivating this tutorial. However, for other courses, questions like these might be less suitable.

On a high-stakes exam, with students who had not had a lot of exam-preparation training, I would not ask these questions because I would not expect students to be prepared for them. On the high-stakes exam, the ambiguity of a wrong answer (might not know the content vs. might not have parsed the question carefully) could lead to the wrong conclusion about the student’s understanding of the material.

In contrast, in an instructional context in which we are trying to teach students to parse what they read with care, there is value in subjecting students to low-risk reminders to read with care.

STYLE AND FORMAT CONCERNS

8. Format items vertically instead of horizontally.

If the options are brief, you could format them as a list of items, one beside the next. However, these lists are often harder to read and it is much harder to keep formatting consistent across a series of questions.

9. Edit items for clarity.

I improve the clarity of my test items in several ways:

I ask colleagues to review the items.

I coteach with other instructors or with teaching assistants. They take the test and discuss the items with me.

I encourage students to comment on test items. I use course management systems, so it is easy to set up a question-discussion forum for students to query, challenge or complain about test items.

In my experience, it is remarkable how many times an item can go through review (and improvement) and still be confusing.

10. Edit items for correct grammar, punctuation, capitalization and spelling.

It is common for instructors to write the stem and the correct choice together when they first write the question. The instructor words the distractors later, often less carefully and in some way that is inconsistent with the correct choice. These differences become undesirable clues about the right and wrong choices.

11. Simplify vocabulary so that reading comprehension does not interfere with testing the content intended.

There’s not much point asking a question that the examinee doesn’t understand. If the examinee doesn’t understand the technical terms (the words or concepts being tested), that’s one thing. But if the examinee doesn’t understand the other terms, the question simply won’t reach the examinee’s knowledge.

12. Minimize reading time. Avoid excessive verbiage.

Students whose first language is not English often have trouble with long questions.

13. Proofread each item.

Despite editorial care, remarkably many simple mistakes survive review or are introduced by mechanical error (e.g. cutting and pasting from a master list to the test itself).

WRITING THE STEM

14. Make the directions as clear as possible.

Consider the following confusingly-written question:

A program will accept a string of letters and digits into a password field. After it accepts the string, it asks for a comparison string, and on accepting a new input from the customer, it compares the first string against the second and rejects the password entry if the strings do not match.

There are 218340105584896 possible tests of 8-character passwords.

This method of password verification is subject to the risk of input-buffer overflow from an excessively long password entry

This specification is seriously ambiguous because it doesn’t tell us whether the program accepts or rejects/filters non-alphanumeric characters into the second password entry

Let us pretend that each of these answers could be correct. Which is correct for this question? Is the stem calling for an analysis of the number of possible tests, the risks of the method, the quality of the specification, or something else?

The stem should make clear whether the question is looking for the best single answer or potentially more than one, and whether the question is asking for facts, opinion, examples, reasoning, a calculation, or something else.

The reader should never have to read the set of possible answers to understand what the question is asking.

15. Make the stem as brief as possible.

This is part of the same recommendation as Heuristic #12 above. If the entire question should be as short as possible (#12), the stem should be as short as possible.

However, “as short as possible” does not necessarily mean “short.”

Here are some examples:

The stem describes some aspect of the program in enough detail that it is possible to compute the number of possible software test cases. The choices include the correct answer and three miscalculations.
The stem describes a software development project in enough detail that the reader can see the possibility of doing a variety of tasks and the benefits they might offer to the project, and then asks the reader to prioritize some of the tasks. The choices are of the form, “X is more urgent than Y.”
The stem describes a potential error in the code, the types of visible symptoms that this error could cause, and then calls for selection of the best test technique for exposing this type of bug.
The stem quotes part of a product specification and then asks the reader to identify an ambiguity or to identify the most serious impact on test design an ambiguity like this might cause.
The stem describes a test, a failure exposed by the test, a stakeholder (who has certain concerns) who receives failure reports and is involved in decisions about the budget for the testing effort, and asks which description of the failure would be most likely to be perceived as significant by that stakeholder. An even more interesting question (faced frequently by testers in the real world) is which description would be perceived as significant (credible, worth reading and worth fixing) by Stakeholder 1 and which other description would be more persuasive for Stakeholder 2. (Someone concerned with next months’ sales might assess risk very differently from someone concerned with engineering / maintenance cost of a product line over a 5-year period. Both concerns are valid, but a good tester might raise different consequences of the same bug for the marketer than for the maintenance manager).

Another trend for writing test questions that address higher-level learning is to write a very long and detailed stem followed by several multiple choice questions based on the same scenario.

Long questions like these are fair game (normal cases) in exams for lawyers, such as the Multistate Bar Exam. They are looked on with less favor in discplines that don’t demand the same level of skill in quickly reading/understanding complex blocks of text. Therefore, for many engineering exams (for example), questions like these are probably less popular.

They discriminate against people whose first language is not English and who are therefore slower readers of complex English text, or more generally against anyone who is a slow reader, because the exam is time-pressed.
They discriminate against people who understand the underlying material and who can reach an application of that material to real-life-complexity circumstances if they can work with a genuine situation or a realistic model (something they can appreciate in a hands-on way) but who are not so good at working from hypotheticals that abstract out all information that the examiner considers inessential.
They can cause a cascading failure. If the exam includes 10 questions based on one hypothetical and the examinee misunderstands that one hypothetical, she might blow all 10 questions.
They can demoralize an examinee who lacks confidence/skill with this type of question, resulting in a bad score because the examinee stops trying to do well on the test.

However, in a low-stakes exam without time limits, those concerns are less important. The exam becomes practice for this type of analysis, rather than punishment for not being good at it.

In software testing, we are constantly trying to simplify a complex product into testable lines of attack. We ignore most aspects of the product and design tests for a few aspects, considered on their own or in combination with each other. We build explicit or implicit mental models of the product under test, and work from those to the tests, and from the tests back to the models (to help us decide what the results should be). Therefore, drawing out the implications of a complex system is a survival skill for testers and questions of this style are entirely fair game–in a low stakes exam, designed to help the student learn, rather than a high-stakes exam designed to create consequences based on an estimate of what the student knows.

16. Place the main idea of the item in the stem, not in the choices.

Some instructors adopt an intentional style in which the stem is extremely short and the question is largely defined in the choices.

The confusingly-written question in Heuristic #14 was an example of a case in which the reader can’t tell what the question is asking until he reads the choices. In #14, there were two problems:

the stem didn’t state what question it was asking
the choices themselves were fundamentally different, asking about different dimensions of the situation described in the stem rather than exploring one dimension with a correct answer and distracting mistakes. The reader had to guess / decide which dimension was of interest as well as deciding which answer might be correct.

Suppose we fix the second problem but still have a stem so short that you don’t know what the question is asking for until you read the options. That’s the issue addressed here (Heuristic #16).

For example, here is a better-written question that doesn’t pass muster under Heuristic #16:

A software oracle:

is defined this way

is defined this other way

is defined this other way

The better question under this heuristic would be:

What is the definition of a software oracle?

this definition

this other definition

this other other definition

As long as the options are strictly parallel (they are alternative answers to the same implied question), I don’t think this is a serious a problem.

17. Avoid irrelevant information (window dressing).

Imagine a question that includes several types of information in its description of some aspect of a computer program:

details about how the program was written

details about how the program will be used

details about the stakeholders who are funding or authorizing the project

details about ways in which products like this have failed before

All of these details might be relevant to the question, but probably most of them are not relevant to any particular question. For example, to calculate the theoretically-possible number of tests of part of the program doesn’t require any knowledge of the stakeholders.

Information:

is irrelevant if you don’t need it to determine which option is the correct answer

unless the reader’s ability to wade through irrelevant information of this type in order to get to the right underlying formula (or generally, the right approach to the problem) is part of the

18. Avoid negative words in the stem.

Here are some examples of stems with negative structure:

Which of the following is NOT a common definition of software testing?

Do NOT assign a priority to a bug report EXCEPT under what condition(s)?

You should generally compute code coverage statistics UNLESS:

For many people, these are harder than questions that ask for the same information in a positively-phrased way.

There is some evidence that there are cross-cultural variations. That is, these questions are harder for some people than others because (probably) of their original language training in childhood. Therefore, a bad result on this question might have more to do with the person’s heritage than with their knowledge or skill in software testing.

However, the ability to parse complex logical expressions is an important skill for a tester. Programmers make lots of bugs when they write code to implement things like:

NOT (A OR B) AND C

So testers have to be able to design tests that anticipate the bug and check whether the programmer made it.

It is not unfair to ask a tester to handle some complex negation, if your intent is to test whether the tester can work with complex logical expressions. But if you think you are testing something else, and your question demands careful logic processing, you won’t know from a bad answer whether the problem was the content you thought you were testing or the logic that you didn’t consider.

Another problem is that many people read negative sentences as positive. Their eyes glaze over when they see the NOT and they answer the question as if it were positive (Which of the following IS a common definition of software testing?) Unless you are testing for glazy eyes, you should make the negation as visible as possible I use ITALICIZED ALL-CAPS BOLDFACE in the examples above.

WRITING THE CHOICES (THE OPTIONS)

19. Develop as many effective options as you can, but two or three may be sufficient.

Imagine an exam with 100 questions. All of them have two options. Someone who is randomly guessing should get 50% correct.

Now imagine an exam with 100 questions that all have four options. Under random guessing, the examinee should get 25%.

The issue of effectiveness is important because an answer that is not credible (not effective) won’t gain any guesses. For example, imagine that you saw this question on a quiz in a software testing course:

Green-box testing is:

common at box manufacturers when they start preparing for the Energy Star rating

a rarely-taught style of software testing

a nickname used by automobile manufacturers for tests of hybrid cars

the name of Glen Myers’ favorite book

I suspect that most students would pick choice 2 because 1 and 3 are irrelevant to the course and 4 is ridiculous (if it was a proper name, for example, “Green-box testing” would be capitalized.) So even though there appear to be 4 choices, there is really only 1 effective one.

The number of choices is important, as is the correction-for-guessing penalty, if you are using multiple choice test results to assign a grade or assess the student’s knowledge in way that carries consequences for the student.

The number of choices — the final score — is much less important if the quiz is for learning support rather than for assessment.

The Open Certification exam is for assessment and has a final score, but it is different from other exams in that examinees can review the questions and consider the answers in advance. Statistical theories of scoring just don’t apply well under those conditions.

20. Vary the location of the right answer according to the number of options. Assign the position of the right answer randomly.

There’s an old rule of thumb–if you don’t know the answer, choose the second one in the list. Some inexperienced exam-writers tend to put the correct answer in the same location more often than if they varied location randomly. Experienced exam-writers use a randomization method to eliminate this bias.

21. Place options in logical or numerical order.

The example that Haladyna gives is numeric. If you’re going to ask the examinee to choose the right number from a list of choices, then present them in order (like $5, $10, $20, $175) rather than randomly (like $20, $5, $175, $20).

In general, the idea underlying this heuristic is that the reader is less likely to make an accidental error (one unrelated to their knowledge of the subject under test) if the choices are ordered and formatted in the way that makes them as easy as possible to read quickly and understand correctly.

22. Keep options independent; choices should not be overlapping.

Assuming standard productivity metrics, how long should it take to create and document 100 boundary tests of simple input fields?

1 hour or less

5 hours or less

between 3 and 7 hours

more than 6 hours

These choices overlap. If you think the correct answer is 4 hours, which one do you pick as the correct answer?

Here is a style of question that I sometimes use that might look overlapping at first glance, but is not:

What is the best course of action in context C?

Do X because of RY (the reason you should do Y).

Do X because of RX (the reason you should do X, but a reason that the examinee is expected to know is impossible in context C)

Do Y because of RY (the correct answer)

Do Y because of RX

Two options tell you to do Y (the right thing to do), but for different reasons. One reason is appropriate, the other is not. The test is checking not just whether the examinee can decide what to do but whether she can correctly identify why to do it. This can be a hard question but if you expect a student to know why to do something, requiring them to pick the right reason as well as the right result is entirely fair.

23. Keep the options homogeneous in content and grammatical structure.

Inexperienced exam writers often accidentally introduce variation between the correct answer and the others. For example, the correct answer:

might be properly punctuated

might start with a capital letter (or not start with one) unlike the others

might end with a period or semi-colon (unlike the others)

might be present tense (the others in past tense)

might be active voice (the others in passive voice), etc.

The most common reason for this is that some exam authors write a long list of stems and correct answers, then fill the rest of the questions in later.

The nasty, sneaky tricky exam writer knows that test-wise students look for this type of variation and so introduces it deliberately:

Which is the right answer?

this is the right answer

This is the better-formatted second-best answer.

this is a wrong answer

this is another wrong answer

The test-savvy guesser will be drawn to answer 2 (bwaa-haaa-haa!)

Tricks are one way to keep down the scores of skilled guessers, but when students realize that you’re hitting them with trick questions, you can lose your credibility with them.

24. Keep the length of options about the same.

Which is the right answer?

this is the wrong answer

This is a really well-qualified and precisely-stated answer that is obviously more carefully considered than the others, so which one do you think is likely to be the right answer?.

this is a wrong answer

this is another wrong answer

25. “None of the above” should be used carefully.

As Haladyna points out, there is a fair bit of controversy over this heuristic:

If you use it, make sure that you make it the correct answer sometimes and the incorrect answer sometimes

Use it when you are trying to make the student actually solve a problem and assess the reasonability of the possible solutions

26. Avoid using “all of the above.”

The main argument against “all of the above” is that if there is an obviously incorrect option, then “all of the above” is obviously incorrect too. Thus, test-wise examinees can reduce the number of plausible options easily. If you are trying to statistically model the difficulty of the exam, or create correction factors (a “correction” is a penalty for guessing the wrong answer), then including an option that is obviously easier than the others makes the modeling messier.

In our context, we aren’t “correcting” for guessing or estimating the difficulty of the exam:

In the BBST (open book) exam, the goal is to get the student to read the material carefully and think about it. Difficulty of the question is more a function of difficulty of the source material than of the question.

In the Open Certification exam, every question appears on a public server, along a justification of the intended-correct answer and public commentary. Any examinee can review these questions and discussions. Some will, some won’t, some will remember what they read and some won’t, some will understand what they read and some won’t–how do you model the difficulty of questions this way? Whatever the models might be, the fact that the “all of the above” option is relatively easy for some students who have to guess is probably a minor factor.

Another argument is more general. Several authors, including Haladyna, Downing, & Rodriguez (2002), recommend against the complex question that allows more than one correct answer. This makes the question more difficult and more confusing for some students.

Even though some authors recommend against it, our question construction adopt a complex structure that allows selection of combinations (such as (a) and (b) as well as all of the above) — because other educational researchers consider this structure a useful vehicle for presenting difficult questions in a fair way. See for example Wongwiwatthananukit, Popovich & Bennett (2000) and their references.

Note that in the BBST / Open Certification structure, the fact that there is a combination choice or an all of the above choice is not informative because most questions have these.

There is a particular difficulty with this structure, however. Consider this question:

Choose the answer:

This is the best choice

This is a bad choice

This is a reasonable answer, but (a) is far better–or this is really a subset of (a), weak on its own but it would be the only correct one if (a) was not present.

(a) and (b)

(a) and (c)

(b) and (c)

(a) and (b) and (c)

In this case, the student will have an unfairly hard time choosing between (a) and (e). We have created questions like this accidentally, but when we recognize this problem, we fix it in one of these ways:

Alternative 1. Choose the answer:

This is the best choice
This is a bad choice
This is a reasonable answer, but (a) is far better–or this is really a subset of (a), weak on its own but it would be the only correct one if (a) was not present.
This is a bad choice
(a) and (b)
(b) and (c)
(a) and (b) and (c)

In this case, we make sure that (a) and (c) is not available for selection.

Alternative 2. Choose the answer:

This is the best choice
This is a bad choice
This is a reasonable answer, but (a) is far better–or this is really a subset of (a), weak on its own but it would be the only correct one if (a) was not present.
This is a bad choice

In this case, no combinations are available for selection.

27. Avoid negative words such as not or except.

This is the same advice, for the options, as we provided in Heuristic #18 for the stem, for the same reasons.

28. Avoid options that give clues to the right answer.

Some of the mistakes mentioned by Haladyna, Downing, & Rodriguez (2002) are:

Broad assertions that are probably incorrect, such as always, never, must, and absolutely.
Choices that sound like words in the stem, or words that sound like the correct answer
Grammatical inconsistencies, length inconsistencies, formatting inconsistencies, extra qualifiers or other obvious inconsistencies that point to the correct choice
Pairs or triplet options that point to the correct choice. For example, if every combination option includes (a) (such as (a) and (b) and (a) and (c) and all of the above) then it is pretty obvious that (a) is probably correct and any answer that excludes (a) (such as (b)) is probably wrong.

29. Make all distractors plausible.

This is important for two reasons:

If you are trying to do statistical modeling of the difficulty of the exam (“There are 4 choices in this question, therefore there is only a 25% chance of a correct answer from guessing”) then implausible distractors invalidate the model because few people will make this guess. However, in our tests, we aren’t doing this modeling so this doesn’t matter.

An implausible choice is a waste of space and time. If no one will make this choice, it is not really a choice. It is just extra text to read.

One reason that an implausible distractor is sometimes valuable is that sometimes students do pick obviously unreasonable distractors. In my experience, this happens when the student is:

ill, and not able to concentrate

falling asleep, and not able to concentrate

on drugs or drunk, and not able to concentrate or temporarily inflicted with a very strange sense of humor

copying answers (in a typical classroom test, looking at someone else’s exam a few feet away) and making a copying mistake.

I rarely design test questions with the intent of including a blatantly implausible option, but I am an inept enough test-writer that a few slip by anyway. These aren’t very interesting in the BBST course, but I have found them very useful in traditional quizzes in the traditionally-taught university course.

30. Use typical errors of students when you write distractors.

Suppose that you gave a fill-in-the-blank question to students. In this case, for example, you might ask the student to tell you the definition rather than giving students a list of definitions to choose from. If you gathered a large enough sample of fill-in-the-blank answers, you would know what the most common mistakes are. Then, when you create the multiple choice question, you can include these as distractors. The students who don’t know the right answer are likely to fall into one of the frequently-used wrong answers.

I rarely have the opportunity to build questions this way, but the principle carries over. When I write a question, I ask “If someone was going to make a mistake, what mistake would they make?”

31. Use humor if it is compatible with the teacher; avoid humor in a high-stakes test.

Robert F. McMorris, Roger A. Boothroyd, & Ã¢â‚¬Å’Debra J. Pietrangelo (1997) and Powers (2005) advocate for carefully controlled use of humor in tests and quizzes. I think this is reasonable in face-to-face instruction, once the students have come to know the instructor (or in a low-stakes test while students are getting to know the instructor). However, in a test that involves students from several cultures, who have varying degrees of experience with the English language, I think humor in a quiz can create more confusion and irritation than it is worth.

References

These notes summarize lessons that came out of the last Workshop on Open Certification (WOC 2007) and from private discussions related to BBST.

There’s a lot of excellent advice on writing multiple-choice test questions. Here are a few sources that I’ve found particularly helpful:

Lorin Anderson, David Krathwohl, & Benjamin Bloom, Taxonomy for Learning, Teaching, and Assessing, A: A Revision of Bloom’s Taxonomy of Educational Objectives, Complete Edition, Longman Publishing, 2000.
National Conference of Bar Examiners, Multistate Bar Examination Study Aids and Information Guides.
Steven J. Burton, Richard R. Sudweeks, Paul F. Merrill, Bud Wood, How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty, Brigham Young University Testing Services, 1991.
Thomas M. Haladyna, Writing Test Items to Evaluate Higher Order Thinking, Allyn & Bacon, 1997.
Thomas M. Haladyna, Developing and Validating Multiple-Choice Test Items, 3rd Edition, Lawrence Erlbaum, 2004.
Thomas M. Haladyna, Steven M. Downing, Michael C. Rodriguez, A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment, Applied Measurement in Education, 15(3), 309Ã¢â‚¬â€œ334, 2002.
Robert F. McMorris, Roger A. Boothroyd, & Ã¢â‚¬Å’Debra J. Pietrangelo, Humor in Educational Testing: A Review and Discussion, Applied Measurement in Education, 10(3), 269-297, 1997.
Ted Powers, Engaging Students with Humor, Association for Psychological Science Observer, 18(12), December 2005.
The Royal College of Physicians and Surgeons of Canada, Developing Multiple Choice Questions for the RCPSC Certification Examinations.
Supakit Wongwiwatthananukit, Nicholas G. Popovich, & Deborah E. Bennett, Assessing pharmacy student knowledge on multiple-choice examinations using partial-credit scoring of combined-response multiple-choice items, American Journal of Pharmaceutical Education, Spring, 2000.
Bibliography and links on Multiple Choice Questions at http://ahe.cqu.edu.au/MCQ.htm

References to my blogs

Posted in education | 1 Comment »

7th Workshop on Teaching Software Testing, January 18-20, 2008

Saturday, October 13th, 2007

This year’s Workshop on Teaching Software Testing (WTST) will be January 18-20 in Melbourne, Florida.

WTST is concerned with the practical aspects of teaching university-caliber software testing courses to academic or commercial students.

This year, we are particularly interested in teaching testing online. How can we help students develop testing skills and foster higher-order thinking in online courses?

We invite participation by:

academics who have experience teaching testing courses
practitioners who teach professional seminars on software testing
academic or practitioner instructors with significant online teaching experience and wisdom
one or two graduate students
a few seasoned teachers or testers who are beginning to build their strengths in teaching software testing.

There is no fee to attend this meeting. You pay for your seat through the value of your participation. Participation in the workshop is by invitation based on a proposal. We expect to accept 15 participants with an absolute upper bound of 25.

WTST is a workshop, not a typical conference. Our presentations serve to drive discussion. The target readers of workshop papers are the other participants, not archival readers. We are glad to start from already-published papers, if they are presented by the author and they would serve as a strong focus for valuable discussion.

In a typical presentation, the presenter speaks 10 to 90 minutes, followed by discussion. There is no fixed time for discussion. Past sessions’ discussions have run from 1 minute to 3 hours. During the discussion, a participant might ask the presenter simple or detailed questions, describe consistent or contrary experiences or data, present a different approach to the same problem, or (respectfully and collegially) argue with the presenter. In 20 hours of formal sessions, we expect to cover six to eight presentations.

We also have lightning presentations, time-limited to 5 minutes (plus discussion). These are fun and they often stimulate extended discussions over lunch and at night.

Presenters must provide materials that they share with the workshop under a Creative Commons license, allowing reuse by other teachers. Such materials will be posted at http://www.wtst.org.

TO ATTEND AS A PRESENTER

Please send a proposal BY DECEMBER 1, 2007 to Cem Kaner that identifies who you are, what your background is, what you would like to present, how long the presentation will take, any special equipment needs, and what written materials you will provide. Along with traditional presentations, we will gladly consider proposed activities and interactive demonstrations.

We will begin reviewing proposals on November 1. We encourage early submissions. It is unlikely but possible that we will have accepted a full set of presentation proposals by December 1.

Proposals should be between two and four pages long, in PDF format. We will post accepted proposals to http://www.wtst.org.

We review proposals in terms of their contribution to knowledge of HOW TO TEACH software testing. Proposals that present a purely theoretical advance in software testing, with weak ties to teaching and application, will not be accepted. Presentations that reiterate materials you have presented elsewhere might be welcome, but it is imperative that you identify the publication history of such work.

By submitting your proposal, you agree that, if we accept your proposal, you will submit a scholarly paper for discussion at the workshop by January 7, 2007. Workshop papers may be of any length and follow any standard scholarly style. We will post these at http://www.wtst.org as they are received, for workshop participants to review before the workshop.

TO ATTEND AS A NON-PRESENTING PARTICIPANT:

Please send a message by BY DECEMBER 1, 2007, to Cem Kaner that describes your background and interest in teaching software testing. What skills or knowledge do you bring to the meeting that would be of interest to the other participants?

ADVISORY BOARD MEETING

Florida Tech’s Center for Software Testing Education & Research has been developing a collection of hybrid and online course materials for teaching black box software testing. We now have NSF funding to adapt these materials for implementation by a broader audience. We are forming an Advisory Board to guide this adaptation and the associated research on the effectiveness of the materials in diverse contexts. The Board will meet before WTST, on January 17, 2008. If you are interested in joining the Board and attending the January meeting, please read this invitation and submit an application.

Acknowledgements

Support for this meeting comes from the Association for Software Testing and Florida Institute of Technology.

The hosts of the meeting are:

Scott Barber (http://www.perftestplus.com)
Rebecca Fiedler (http://www.beckyfiedler.com)
Cem Kaner (https://kaner.com and http://www.testingeducation.org)

Posted in education, testing | 1 Comment »

Research Funding and Advisory Board for the Black Box Software Testing (BBST) Course

Friday, October 12th, 2007

Summary: With some new NSF funding, we are researching and revising BBST to make it more available and more useful to more people around the world. The course materials will continue to be available for free. If you are interesting in joining an advisory board that helps us set direction for the course and the research surrounding the course, please contact me, describing your background in software-testing-related education, in education-related research, and your reason(s) for wanting to join the Board.

Starting as a joint project with Hung Quoc Nguyen in 1993, I’ve done a lot of development of a broad set of course materials for black box software testing. The National Science Foundation approved a project (EIA-0113539 ITR/SY+PE “Improving the Education of Software Testers) that evolved my commercial-audience course materials for an academic audience and researched learning issues associated with testing. The resulting course materials are at http://www.testingeducation.org/BBST, with lots of papers at http://www.testingeducation.org/articles and https://kaner.com/?page_id=7. The course materials are available for everyone’s use, for free, under a Creative Commons license.

During that research, I teamed up with Rebecca Fiedler, an experienced teacher (now an Assistant Professor of Education at St. Mary-of-the-Woods College in Terre Haute, Indiana, and also now my wife.) The course that Rebecca and I evolved turned traditional course design inside out in order to encourage students’ involvement, skill development and critical thinking. Rather than using class time for lectures and students’ private time for activities (labs, assignments, debates, etc.), we videotaped the lectures and required students to watch them before coming to class. We used class time for coached activities centered more on the students than the professor.

This looked like a pretty good teaching approach, our students liked it, and the National Science Foundation funded a project to extend this approach to developing course materials on software engineering ethics in 2006. (If you would like to collaborate with us on this project, or if you are a law student interested in a paid research internship, contact Cem Kaner.)

Recently, the National Science Foundation approved Dr. Fiedler’s and my project to improve the BBST course itself, “Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing.” With funding running from October 1, 2007 through 2010, our primary goals are:

develop and sustain a cadre of academic, in-house, and commercial instructors via:
- creating and offering an instructor orientation course online;
- establishing an ongoing online instructorsÃ¢â‚¬â„¢ forum; and
- hosting a number of face-to-face instructor meetings
offer and evaluate the course at collaborating research sites (including both universities and businesses)
analyze several collections of in-class activities to abstract a set of themes / patterns that can help instructors quickly create new activities as needed; and
extend instructional support material including grading guides and a pool of exam questions for teaching the course.

All of our materialsÃ¢â‚¬â€such as videos, slides, exams, grading guides, and instructor manualsÃ¢â‚¬â€are Creative Commons licensed. Most are available freely to the public. A few items designed to help instructors grade student work will be available at no charge, but only to instructors.

Several individuals and organizations have agreed to collaborate in this work, including:

AppLabs Technologies. Representative: Geetha Narayanan, CSQA, PMP; Shyam Sunder Depuru.
Aztechsoft. Representative: Ajay Bhagwat.
The Association for Software Testing. Representative: Michael Kelly, President.

AST is breaking the course into several focused, online, mini-courses that run 1 month each. The courses are offered, for free, to AST members. AST is starting its second teaching of the Foundations course this week. We’ll teach Bug Advocacy in a month. As we develop these courses, we are training instructors who, after sufficient training, will teach the course(s) they are trained to teach for AST (free courses) as well as at their school or company (for free or fee, as they choose).

Dalhousie University. Representative: Professor Morven Gentleman.
Huston-Tillotson University, Computer Science Department. Representative: Allen M. Johnson, Jr., Ph.D.
Microsoft. Representative: Marianne Guntow.
PerfTest Plus. Representative: Scott Barber.
Quardev Laboratories. Representative: Jonathan Bach.
University of Illinois at Springfield, Computer Sciences Program. Representative: Dr. Keith W. Miller.
University of Latvia. Representative, Professor Juris Borzovs.

If you would like to collaborate on this project as well:

Please read our research proposal.
Please consider your ability to make a financial commitment. We are not asking for donations (well, of course, we would love to get donations, but they are not required) but you or your company would have to absorb the cost of travel to Board of Advisor meetings and you would probably come to the Workshop on Teaching Software Testing and/or the Conference of the Association for Software Testing. Additionally, teaching the course at your organization and collecting the relevant data would be at your expense. (My consultation to you on this teaching would be free, but if you needed me to fly to your site, that would be at your expense and might involve a fee.) We have a little bit of NSF money to subsidize travel to Board of Advisor meetings ($15,000 total for the three years) so we can subsidize travel to a small degree. But it is very limited, and especially little is available for corporations.
Please consider your involvement. What do you want to do?
- Join the advisory board, help guide the project?
- Collaborate on the project as a fellow instructor (and get instructor training) ?
- Come to the Workshop on Teaching Software Testing?
- Help develop a Body of Knowledge to support the course materials?
- Participate as a lecturer or on-camera discussant on the video courses?
- Other stuff, such as …???
Send me a note, that covers 1-3, introduces you and describes your background and interest.

The first meeting of the Advisory Board is January 17, 2008, in Melbourne, Florida. We will host the Workshop on Teaching Software Testing (WTST 2008) from January 18-20. I’ll post a Call for Participation for WTST 2008 on this blog tomorrow.