Archive for the ‘education’ Category

A few new articles

Saturday, December 26th, 2009

I finally found some time to update my website, posting links to some more of my papers and presentations.

There are a few themes:

  • Investment modeling as a new exemplar. Software testing helps us understand the quality of the product or service under test. There are generically useful approaches to test design, like quicktests and tours and other basic techniques, but I think we add our greatest value when we apply a deeper understanding of the application. Testing instructors don’t teach this well because it takes so long to build a classroom-wide understanding of an application that is complex enough to be interesting. For the last 14 months, I’ve been exploring investment modeling as a potential exemplar of deep and interesting testing within a field that many people can grasp quickly.
  • Exploratory test automation. I don’t understand why people say that exploratory testing is always manual testing. Doug Hoffman and I have been teaching “high-volume” test techniques for twelve years (I wrote about some of these back in Testing Computer Software) that don’t involve regression testing or test-by-test scripting. We run these to explore new risks; we change our parameters to shift the focus of our search (to try something new or to go further in depth if we’re onto something interesting). This is clearly exploratory, but it is intensely automated. I’m now using investment modeling to illustrate this, and starting to work with Scott Barber to use performance modeling to illustrate it as well. Doug is working through a lot of historical approaches; perhaps the three of us can integrate our work, a lot of interesting work published by other folks, into something that more clearly conveys the general idea.
  • Instructional design: Teaching software testing. Rebecca Fiedler, Scott Barber and I have worked through a model for online education in software testing that fosters deeper learning than many other approaches. The Association for Software Testing has been a major testbed for this approach. We’ve also been doing a lot in academic institutions, comparing notes in detail with faculty at other schools.
  • The evolving law of software quality. Federal and state legislatures have failed to adopt laws governing software contracting and software quality. Because of this, American judges have had to figure out for themselves what legal rules should be applied–until Congress or the state legislatures finally get around to giving clear and constitutional guidance to the courts. This spring, the American Law Institute unanimously adopted the Principles of the Law of Software Contracts, which includes some positions that I’ve been advocating for 15 years. The set of papers below includes some discussion of the Principles. In addition, I’m kicking off a wiki-based project to update my book, Bad Software, to give customers good advice about their rights and their best negotiating tactics under the current legal regime. I’ll blog more about this later, looking for volunteers to help update the book.

Here’s the list of new stuff:

  1. Cem Kaner, “Exploratory test automation: Investment modeling as an example.” [SLIDES]. ImmuneIT, Amsterdam, October 2009.
  2. Cem Kaner, “Investment modeling: A software engineer’s approach.” [SLIDES]. Colloquium, Florida Institute of Technology, October 2009.
  3. Cem Kaner, “Challenges in the Evolution of Software Testing Practices in Mission-Critical Environments.” [SLIDES]. Software Test & Evaluation Summit/Workshop (National Defense Industrial Association), Reston VA, September 2009.
  4. Cem Kaner, “Approaches to test automation.” [SLIDES]. Research in Motion, Kitchener/Waterloo, September 2009.
  5. Cem Kaner, “Software Testing as a Quality-Improvement Activity” [SLIDES]. Lockheed Martin / IEEE Computer Society Webinar Series, September 2009.
  6. Rebecca L. Fiedler & Cem Kaner, “Putting the context in context-driven testing (an application of Cultural Historical Activity Theory)” [SLIDES]. Conference of the Association for Software Testing. Colorado Springs, CO., July 2009.
  7. Cem Kaner, “Metrics, qualitative measurement, and stakeholder value” [SLIDES]. Tutorial, Conference of the Association for Software Testing. Colorado Springs, CO., July 2009.
  8. Cem Kaner, “The value of checklists and the danger of scripts: What legal training suggests for testers.” [SLIDES]. Conference of the Association for Software Testing. Colorado Springs, CO., July 2009.
  9. Cem Kaner, “New rules adopted for software contracts.” [SLIDES]. Conference of the Association for Software Testing. Colorado Springs, CO., July 2009.
  10. Cem Kaner, “Activities in software testing education: a structure for mapping learning objectives to activity designs“. Software Testing Education Workshop (International Conference on Software Testing), Denver, CO, April 2009.
  11. Cem Kaner, “Plagiarism-detection software Clashing intellectual property rights and aggressive vendors yield dismaying results.” [SLIDES] [VIDEO]. Colloquium, Florida Institute of Technology, October 2009.
  12. Cem Kaner, “Thinking about the Software Testing Curriculum.” [SLIDES]. Workshop on Integrating Software Testing into Programming Courses, Florida International University, March 2009.
  13. Cem Kaner (initial draft), “Dimensions of Excellence in Research“. Department of Computer Sciences, Florida Institute of Technology, Spring 2009.
  14. Cem Kaner, “Patterns of activities, exercises and assignments.” [SLIDES]. Workshop on Teaching Software Testing, Melbourne FL, January 2009.
  15. Cem Kaner & Rebecca L. Fiedler, “Developing instructor-coached activities for hybrid and online courses.” [SLIDES]. Workshop at Inventions & Impact 2: Building Excellence in Undergraduate Science, Technology, Engineering & Mathematics (STEM) Education, National Science Foundation / American Association for the Advancement of Science, Washington DC, August 2008.
  16. Cem Kaner, Rebecca L. Fiedler, & Scott Barber, “Building a free courseware community around an online software testing curriculum.” [SLIDES]. Poster Session at Inventions & Impact 2: Building Excellence in Undergraduate Science, Technology, Engineering & Mathematics (STEM) Education, National Science Foundation / American Association for the Advancement of Science, Washington DC, August 2008.
  17. Cem Kaner, Rebecca L. Fiedler, & Scott Barber, “Building a free courseware community around an online software testing curriculum.” [SLIDES]. MERLOT conference, Minneapolis, August 2008.
  18. Cem Kaner, “Authentic assignments that foster student communication skills” [SLIDES], Teaching Communication Skills in the Software Engineering Curriculum: A Forum for Professionals and Educators (NSF Award #0722231), Miami University, Ohio, June 2008.
  19. Cem Kaner, “Comments on the August 31, 2007 Draft of the Voluntary Voting System Guidelines.” Submitted to the United States Election Assistance Commission, May 2008.
  20. Cem Kaner and Rebecca L. Fiedler, “A cautionary note on checking software engineering papers for plagiarism.”IEEE Transactions on Education, vol. 51, issue 2, 2008, pp. 184-188.
  21. Cem Kaner, “Software testing as a social science,” [SLIDES] STEP 2000 Workshop on Software Testing, Memphis, May 2008.
  22. Cem Kaner & Stephen J. Swenson, “Good enough V&V for simulations: Some possibly helpful thoughts from the law & ethics of commercial software.” [SLIDES] Simulation Interoperability Workshop, Providence, RI, April 2008.
  23. Cem Kaner, “Improve the power of your tests with risk-based test design.” [SLIDES] QAI QUEST Conference, Chicago, April 2008
  24. Cem Kaner, “Risk-based testing: Some basic concepts.” [SLIDES] QAI Managers Workshop, QUEST Conference, Chicago, April 2008
  25. Cem Kaner, “A tutorial in exploratory testing.” [SLIDES] QAI QUEST Conference, Chicago, April 2008
  26. Cem Kaner, “Adapting Academic Course Materials in Software Testing for Industrial Professional Development.” [SLIDES] Colloquium, Florida Institute of Technology, March 2008
  27. Cem Kaner, “BBST at AST: Adaptation of a course in black box software testing.” [SLIDES]. Workshop on Teaching Software Testing, Melbourne FL, January 2008.
  28. Cem Kaner, “BBST: Evolving a course in black box software testing.” [SLIDES] BBST Project Advisory Board Meeting, January 2008

AST Instructors’ Tutorial at CAST in Toronto

Wednesday, May 28th, 2008

You’ve read about the Association for Software Testing’s free software testing courses. Now find out how you can get involved in teaching these for AST, for your company, or independently. This workshop will use presentations, lectures, and hands-on exercises to address the challenges of teaching online: Becky Fiedler, Scott Barber and I will host the Live! AST Instructors Orientation Course Jumpstart Tutorial On July 17, 2008, in conjunction with this year’s Conference of the AST (CAST).

To register for this Tutorial, go to

To register for CAST (early registration ends June 1)  go to

More Details

AST is developing a series of online courses, free to our members, each taught by a team of volunteer instructors. AST will grant a certificate in software testing to members who have completed 10 courses. (We hope to develop our 10th course by July 2009).

AST trains and certifies instructors: The main requirements for an AST member to be certified to teach an AST course are (a) completing the AST course on teaching online, (b) teaching the same course three times under supervision, (c) approval by the currently-certified instructors for that course, and agreement to teach the course a few times for AST (for free).

As a certified instructor, you can offer the course for AST credit: for AST (for free), for your company, or on your own. You or your company can charge fees without paying AST any royalties or other fees. (AST can only offer each free course a few times per year–if demand outstrips that supply, instructors will have a business opportunity to fill that gap.)

This Tutorial, the day after CAST, satisfies the Instructors Orientation Course requirement for prospective AST-certified instructors.

This workshop will use presentations, lectures, and hands-on exercises to address the challenges of teaching online. (Bring your laptop and wireless card if you can.) The presenters will merge instructional theory and assessment theory to show you how they developed the AST-BBST online instructional model. Over lunch, Scott Barber will lead a panel discussion of AST members who are working on AST Instructor Certification.

Your registration includes a boxed lunch and light snacks in the morning and afternoon.

This workshop is partially based on research that was supported by NSF Grants EIA-0113539 ITR/SY+PE:“Improving the Education of Software Testers and CCLI-0717613 “Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this workshop are those of the presenter(s) and do not necessarily reflect the views of the National Science Foundation.

Four more presentations

Sunday, March 30th, 2008

“Adapting Academic Course Materials in Software Testing for Industrial Professional Development.” [SLIDES] Colloquium, Florida Institute of Technology, March 2008

The Association for Software Testing and I have been adapting the BBST course for online professional development. This presentation updates my students and colleagues at work on what we’re doing to transfer fairly rigorous academic course materials and teaching methods to a practitioner audience.

These next three are reworkings of presentations I’ve given a few times before:

“Software testing as a social science,” [SLIDES] STEP 2000 Workshop on Software Testing, Memphis, May 2008.

Social sciences study humans, especially humans in society. The social scientist’s core question, for any new product or technology is, “What will be the impact of X on people?” Social scientists normally deal with ambiguous issues, partial answers, situationally specific results, diverse interpretations and values– and they often use qualitative research methods. If we think about software testing in terms of the objectives (why we test) and the challenges (what makes testing difficult) rather than the methods and processes, then I think testing is more like a social science than like programming or manufacturing quality control. As with all social sciences, tools are important. But tools are what we use, not why we use them.

“The ongoing revolution in software testing,” [SLIDES] October 2007

My intent in this talk is to challenge an orthodoxy in testing, a set of ommonly accepted assumptions about our mission, skills, and onstraints, including plenty that seemed good to me when I published them in 1988, 1993 or 2001. Surprisingly, some of the old notions lost popularity in the 1990’s but came back under new marketing with the rise of eXtreme Programming.

I propose we embrace the idea that testing is an active, skilled technical investigation. Competent testers are investigators—clever, sometimes mischievous researchers—active learners who dig up information about
a product or process just as that information is needed.

I think that

  • views of testing that don’t portray testing this way are obsolete and counterproductive for most contexts and
  • educational resources for testing that don’t foster these skills and activities are misdirected and misleading.

“Software-related measurement: Risks and opportunties,” [SLIDES] October 2007

I’ve seen published claims that only 5% of software companies have metrics programs. Why so low? Are we just undisciplined and lazy? Most managers who I know have tried at least one measurement program–and abandoned them because so many programs do more harm than good, at a high cost. This session has
three parts:

  1. Measurement theory and how it applies to software development metrics (which, at their core, are typically human performance measures).
  2. A couple of examples of qualitative measurements that can drive useful behavior.
  3. (Consideration of client’s particular context–deleted.)

Writing Multiple Choice Test Questions

Wednesday, October 24th, 2007


This is a tutorial on creating multiple choice questions, framed by Haladyna’s heuristics for test design and Anderson & Krathwohl’s update to Bloom’s taxonomy. My interest in computer-gradable test questions is to support teaching and learning rather than high-stakes examination. Some of the design heuristics are probably different for this case. For example, which is the more desirable attribute for a test question:

  1. defensibility (you can defend its fairness and appropriateness to a critic) or
  2. potential to help a student gain insight?

In high-stakes exams, (a) [defensibility] is clearly more important, but as a support for learning, I’d rather have (b) [support for insight].

This tutorial’s examples are from software engineering, but from my perspective as someone who has also taught psychology and law, I think the ideas are applicable across many disciplines.

The tutorial’s advice and examples specifically target three projects:



1. Consider a question with the following structure:

Choose the answer:

  1. First option
  2. Second option

The typical way we will present this question is:

Choose the answer:

  1. First option
  2. Second option
  3. Both (a) and (b)
  4. Neither (a) nor (b)

  • If the correct answer is (c) then the examinee will receive 25% credit for selecting only (a) or only (b).

2. Consider an question with the following structure:

Choose the answer:

  1. First option
  2. Second option
  3. Third option

The typical way we will present this question is:

Choose the answer:

  1. First option
  2. Second option
  3. Third option
  4. (a) and (b)
  5. (a) and (c)
  6. (b) and (c)
  7. (a) and (b) and (c)

  • If the correct answer is (d), the examinee will receive 25% credit for selecting only (a) or only (b). Similarly for (e) and (f).
  • If the correct answer is (g) (all of the above), the examinee will receive 25% credit for selecting (d) or (e) or (f) but nothing for the other choices.

3. Consider an question with the following structure:

Choose the answer:

  1. First option
  2. Second option
  3. Third option
  4. Fourth option

The typical ways we might present this question are:

Choose the answer:

  1. First option
  2. Second option
  3. Third option
  4. Fourth option


Choose the answer:

  1. First option
  2. Second option
  3. Third option
  4. Fourth option
  5. (a) and (c)
  6. (a) and (b) and (d)
  7. (a) and (b) and (c) and (d)

There will be a maximum of 7 choices.

The three combination choices can be any combination of two, three or four of the first four answers.

  • If the correct answer is like (e) (a pair), the examinee will receive 25% credit for selecting only (a) or only (b) and nothing for selecting a combination that includes (a) and (b) but also includes an incorrect choice.
  • If the correct answer is (f) (three of the four), the examinee will receive 25% credit for selecting a correct pair (if (a) and (b) and (d) are all correct, then any two of them get 25%) but nothing for selecting only one of the three or selecting a choice that includes two or three correct but also includes an incorrect choice.
  • If the correct answer is (g) (all correct), the examinee will receive a 25% credit for selecting a correct triple.



Here are a few terms commonly used when discussing the design of multiple choice questions. See the Reference Examples, below.

  • Test: In this article, the word “test” is ambiguous. Sometimes we mean a software test (an experiment that can expose problems in a computer program) and sometimes an academic test (a question that can expose problems in someone’s knowledge). In these definitions, “test” means “academic test.”

  • Test item: a test item is a single test question. It might be a multiple choice test question or an essay test question (or whatever).
  • Content item: a content item is a single piece of content, such as a fact or a rule, something you can test on.
  • Stem: The opening part of the question is called the stem. For example, “Which is the best definition of the testing strategy in a testing project? ” is Reference Example B’s stem.
  • Distractor: An incorrect answer. In Reference Example B, (b) and (c) are distractors.
  • Correct choice: The correct answer for Reference Example B is (a) “The plan for applying resources and selecting techniques to achieve the testing mission.”
  • The Question format: The stem is a complete sentence and asks a question that is answered by the correct choice and the distractors. Reference Example A has this format.
  • The Best Answer format: The stem asks a complete question. Most or all of the distractors and the correct choice are correct to some degree, but one of them is stronger than the others. In Reference Example B, all three answers are plausible but in the BBST course, given the BBST lectures, (a) is the best.
  • The Incomplete Stem format: The stem is an incomplete sentence that the correct choice and distractors complete. Reference Example C has this format.
  • Complex formats: In a complex-format question, the alternatives include simple answers and combinations of these answers. In Reference Example A, the examinee can choose (a) “We can never be certain that the program is bug free” or (d) which says that both (a) and (b) are true or (f) which says that all of the simple answers (a, b and c) are true.
  • Learning unit: A learning unit typically includes a limited set of content that shares a common theme or purpose, plus learning support materials such as a study guide, test items, an explicit set of learning objectives, a lesson plan, readings, lecture notes or video, etc.
  • High-stakes test: A test is high-stakes if there are significant benefits for passing the test or significant costs of failing it.

The Reference Examples

For each of the following, choose one answer.

A. What are some important consequences of the impossibility of complete testing?

  1. We can never be certain that the program is bug free.
  2. We have no definite stopping point for testing, which makes it easier for some managers to argue for very little testing.
  3. We have no easy answer for what testing tasks should always be required, because every task takes time that could be spent on other high importance tasks.
  4. (a) and (b)
  5. (a) and (c)
  6. (b) and (c)
  7. All of the above

B. Which is the best definition of the testing strategy in a testing project?

  1. The plan for applying resources and selecting techniques to achieve the testing mission.
  2. The plan for applying resources and selecting techniques to assure quality.
  3. The guiding plan for finding bugs.

C. Complete statement coverage means …

  1. That you have tested every statement in the program.
  2. That you have tested every statement and every branch in the program.
  3. That you have tested every IF statement in the program.
  4. That you have tested every combination of values of IF statements in the program.

D. The key difference between black box testing and behavioral testing is that:

  1. The test designer can use knowledge of the program’s internals to develop a black box test, but cannot use that knowledge in the design of a behavioral test because the behavioral test is concerned with behavior, not internals.
  2. The test designer can use knowledge of the program’s internals to develop a behavioral test, but cannot use that knowledge in the design of a black box test because the designer cannot rely on knowledge of the internals of the black box (the program).
  3. The behavioral test is focused on program behavior whereas the black box test is concerned with system capability.
  4. (a) and (b)
  5. (a) and (c)
  6. (b) and (c)
  7. (a) and (b) and (c)

E. What is the significance of the difference between black box and glass box tests?

  1. Black box tests cannot be as powerful as glass box tests because the tester doesn’t know what issues in the code to look for.
  2. Black box tests are typically better suited to measure the software against the expectations of the user, whereas glass box tests measure the program against the expectations of the programmer who wrote it.
  3. Glass box tests focus on the internals of the program whereas black box tests focus on the externally visible behavior.


Several papers on the web organize their discussion of multiple choice tests around a researched set of advice from Haladyna, Downing & Rodriguez or the updated list from Haladyna (2004). I’ll do that too, tying their advice to back to our needs for software testing.

Content Guidelines
Every item should reflect specific content and a single specific cognitive process, as called for in the test specifications (table of specifications, two-way grid, test blueprint).
Base each item on important content to learn; avoid trivial content.
Use novel material to meaure understanding and the application of knowledge and skills.
Keep the content of an item independent from content of other items on the test.
Avoid overspecific and overgeneral content.
Avoid opinion-based items.
Avoid trick items.
Format items vertically instead of horizontally.
Style and Format Concerns
Edit items for clarity.
Edit items for correct grammar, punctuation, capitalization and spelling.
Simplify vocabulary so that reading comprehension does not interfere with testing the content intended.
Minimize reading time. Avoid excessive verbiage.
Proofread each item.
Writing the Stem
Make the directions as clear as possible.
Make the stem as brief as possible.
Place the main idea of the item in the stem, not in the choices.
Avoid irrelevant information (window dressing).
Avoid negative words in the stem.
Writing Options
Develop as many effective options as you can, but two or three may be sufficient.
Vary the location of the right answer according to the number of options. Assign the position of the right answer randomly.
Place options in logical or numerical order.
Keep options independent; choices should not be overlapping.
Keep the options homogeneous in content and grammatical structure.
Keep the length of options about the same.
“None of the above” should be used sparingly.
Avoid using “all of the above.”
Avoid negative words such as not or except.
Avoid options that give clues to the right answer.
Make all distractors plausible.
Use typical errors of students when you write distractors.
Use humor if it is compatible with the teacher; avoid humor in a high-stakes test.

Now to apply those to our situation.


1. Every item should reflect specific content and a single specific cognitive process, as called for in the test specifications (table of specifications, two-way grid, test blueprint).

Here are the learning objectives from the AST Foundations course. Note the grid (the table), which lists the level of knowledge and skills in the course content and defines the level of knowledge we hope the learner will achieve. For discussions of level of knowledge, see my blog entries on Bloom’s taxonomy [1] [2] [3]:

Learning Objectives of the AST Foundations Course Anderson / Krathwohl level
1 Familiar with basic terminology and how it will be used in the BBST courses Understand
2 Aware of honest and rational controversy over definitions of common concepts and terms in the field Understand
3 Understand there are legitimately different missions for a testing effort. Understand the argument that selection of mission depends on contextual factors . Able to evaluate relatively simple situations that exhibit strongly different contexts in terms of their implication for testing strategies. Understand, Simple evaluation
4 Understand the concept of oracles well enough to apply multiple oracle heuristics to their own work and explain what they are doing and why Understand and apply
5 Understand that complete testing is impossible. Improve ability to estimate and explain the size of a testing problem. Understand, rudimentary application
6 Familiarize students with the concept of measurement dysfunction Understand
7 Improve students’ ability to adjust their focus from narrow technical problems (such as analysis of a single function or parameter) through broader, context-rich problems Analyze
8 Improve online study skills, such as learning more from video lectures and associated readings Apply
9 Improve online course participation skills, including online discussion and working together online in groups Apply
10 Increase student comfort with formative assessment (assessment done to help students take their own inventory, think and learn rather than to pass or fail the students) Apply

For each of these objectives, we could list the items that we want students to learn. For example:

  • list the terms that students should be able to define
  • list the divergent definitions that students should be aware of
  • list the online course participation skills that students should develop or improve.

We could create multiple choice tests for some of these:

  • We could check whether students could recognize a term’s definition.
  • We could check whether students could recognize some aspect of an online study skill.

But there are elements in the list that aren’t easy to assess with a multiple choice test. For example, how can you tell whether someone works well with other students by asking them multiple choice questions? To assess that, you should watch how they work in groups, not read multiple-choice answers.

Now, back to Haladyna’s first guideline:

  • Use an appropriate type of test for each content item. Multiple choice is good for some, but not all.
  • If you use a multiple choice test, each test item (each question) should focus on a single content item. That might be a complex item, such as a rule or a relationship or a model, but it should be something that you and the student would consider to be one thing. A question spread across multiple issues is confusing in ways that have little to do with the content being tested.
  • Design the test item to assess the material at the right level (see the grid, above). For example, if you are trying to learn whether someone can use a model to evaluate a situation, you should ask a question that requires the examinee to apply the model, not one that just asks whether she can remember the model.

When we work with a self-contained learning unit, such as the individual AST BBST courses and the engineering ethics units, it should be possible to list most of the items that students should learn and the associated cognitive level.

However, for the Open Certification exam, the listing task is much more difficult because it is fair game to ask about any of the field’s definitions, facts, concepts, models, skills, etc. None of the “Body of Knowledge” lists are complete, but we might use them as a start for brainstorming about what would be useful questions for the exam.

The Open Certification (OC) exam is different from other high-stakes exams because the OC question database serves as a study guide. Questions that might be too hard in a surprise-test (a test with questions you’ve never seen before) might be instructive in a test database that prepares you for an exam derived from the database questions–especially when the test database includes discussion of the questions and answers, not just the barebones questions themselves.

2. Base each item on important content to learn; avoid trivial content.

The heuristic for Open Certification is: Don’t ask the question unless you think a hiring manager would actually care whether this person knew the answer to it.

3. Use novel material to meaure understanding and the application of knowledge and skills.

That is, reword the idea you are asking about rather than using the same words as the lecture or assigned readings. This is important advice for a traditional surprise test because people are good matchers:

  • If I show you exactly the same thing that you saw before, you might recognize it as familiar even if you don’t know what it means.
  • If I want to be a nasty trickster, I can put exact-match (but irrelevant) text in a distractor. You’ll be more likely to guess this answer (if you’re not sure of the correct answer) because this one is familiar.

This is important advice for BBST because the student can match the words to the readings (in this open book test) without understanding them. In the open book exam, this doesn’t even require recall.

On the other hand, especially in the open book exams, I like to put exact matches in the stem. The stem is asking a question like, What does this mean? or What can you do with this? If you use textbook phrases to identify the this, then you are helping the student figure out where to look for possible answers. In the open book exam, the multiple choice test is a study aid. It is helpful to orient the student to something you want him to think about and read further about.

4. Keep the content of an item independent from content of other items on the test.

Suppose that you define a term in one question and then ask how to apply the concept in the next. The student who doesn’t remember the definition will probably be able to figure it out after reading the next question (the application).

It’s a common mistake to write an exam that builds forward without realizing that the student can read the questions and answer them in any order.

5. Avoid overspecific and overgeneral content.

The concern with questions that are overly specific is that they are usually trivial. Does it really matter what year Boris Beizer wrote his famous Software Testing Techniques? Isn’t it more important to know what techniques he was writing about and why?

There are some simple facts that we might expect all testers to know.

For example, what’s the largest ASCII code in the lower ASCII character set, and what character does it signify?

The boundary cases for ASCII might be core testing knowledge, and thus fair game.

However, in most cases, facts are easy to look up in books or with an electronic search. Before asking for a memorized fact, ask why you would care whether the tester had memorized that fact or not.

The concern with questions that are overly general is that they are also usually trivial–or wrong–or both.

6. Avoid opinion-based items.

This is obvious, right? A question is unfair if it asks for an answer that some experts would consider correct and rejects an answer that other experts would consider correct.

But we have this problem in testing.

There are several mutually exclusive definitions of “test case.” There are strong professional differences about the value of a test script or the utility of the V-model or even whether the V-model was implicit in the waterfall model (read the early papers) or a more recent innovation.

Most of the interesting definitions in our field convey opinions, and the Standards that assert the supposedly-correct definitions get that way by ignoring the controversies.

What tactics can we use to deal with this?

a. The qualified opinion.

For example, consider this question:

“The definition of exploratory testing is…”

and this answer:

“a style of software testing that emphasizes the personal freedom and responsibility of the individual tester to continually optimize the value of her work by treating test-related learning, test design, test execution, and test result interpretation as mutually supportive activities that run in parallel throughout the project.”

Is the answer correct or not?

Some people think that exploratory testing is bound tightly to test execution; they would reject the definition.

On the other hand, if we changed the question to,

“According to Cem Kaner, the definition of exploratory testing is…”

that long definition would be the right answer.

Qualification is easy in the BBST course because you can use the qualifier, According to the lecture. This is what the student is studying right now and the exam is open book, so the student can check the fact easily.

Qualification is more problematic for closed-book exams like the certification exam. In this general case, can we fairly expect students to know who prefers which definition?

The problem is that qualified opinions contain an often-trivial fact. Should we really expect students or certification-examinees to remember definitions in terms of who said what? Most of the time, I don’t think so.

b. Drawing implications

For example, consider asking a question in one of these ways:

  • If A means X, then if you do A, you should expect the following results.
  • Imagine two definitions of A: X and Y. Which bugs would you be more likely to expose if you followed X in your testing and which if you followed Y?
  • Which definition of X is most consistent with theory Y?

7. Avoid trick items.

Haladyna (2004, p. 104) reports work by Roberts that identified several types of (intentional or unintentional) tricks in questions:

    1. The item writer’s intention appeared to deceive, confuse, or mislead test takers.
    2. Trivial content was represented (which vilates one of our item-writing guidelines)
    3. The discrimination among options was too fine.
    4. Items had window dressing that was irrelevant to the problem.
    5. Multiple correct answers were possible.
    6. Principles were presented in ways that were not learned, thus deceiving students.
    7. Items were so highly ambiguous that even the best students had no idea about the right answer.

Some other tricks that undermine accurate assessment:

    1. Put text in a distractor that is irrelevant to the question but exactly matches something from the assigned readings or the lecture.
    2. Use complex logic (such as not (A and B) or a double negative) — unless the learning being tested involves complex logic.
    3. Accurately qualify a widely discreted view: According to famous-person, the definition of X is Y, where Y is a definition no one accepts any more, but famous-person did in fact publish it.
    4. In the set of items for a question, leave grammatical errors in all but the second-best choice. (Many people will guess that the grammatically-correct answer is the one intended to be graded as correct.)

Items that require careful reading are not necessarily trick items. This varies from field to field. For example, my experience with exams for lawyers and law students is that they often require very precise reading. Testers are supposed to be able to do very fine-grained specification analysis.

Consider Example D:

D. The key difference between black box testing and behavioral testing is that:

The options include several differences that students find plausible. Every time I give this question, some students choose a combination answer (such as (a) and (b)). This is a mistake, because the question calls for “The key difference,” and that cannot be a collection of two or more differences.

Consider Example E:

E. What is the significance of the difference between black box and glass box tests?

A very common mistake is to choose this answer:

Glass box tests focus on the internals of the program whereas black box tests focus on the externally visible behavior.

The answer is an accurate description of the difference, but it says nothing about the significance of the difference. Why would someone care about the difference? What is the consequence of the difference?

Over time, students learn to read questions like this more carefully. My underlying assumption is that they are also learning or applying, in the course of this, skills they need to read technical documents more carefully. Those are important skills for both software testing and legal analysis and so they are relevant to the courses that are motivating this tutorial. However, for other courses, questions like these might be less suitable.

On a high-stakes exam, with students who had not had a lot of exam-preparation training, I would not ask these questions because I would not expect students to be prepared for them. On the high-stakes exam, the ambiguity of a wrong answer (might not know the content vs. might not have parsed the question carefully) could lead to the wrong conclusion about the student’s understanding of the material.

In contrast, in an instructional context in which we are trying to teach students to parse what they read with care, there is value in subjecting students to low-risk reminders to read with care.


8. Format items vertically instead of horizontally.

If the options are brief, you could format them as a list of items, one beside the next. However, these lists are often harder to read and it is much harder to keep formatting consistent across a series of questions.

9. Edit items for clarity.

I improve the clarity of my test items in several ways:

  • I ask colleagues to review the items.
  • I coteach with other instructors or with teaching assistants. They take the test and discuss the items with me.
  • I encourage students to comment on test items. I use course management systems, so it is easy to set up a question-discussion forum for students to query, challenge or complain about test items.

In my experience, it is remarkable how many times an item can go through review (and improvement) and still be confusing.

10. Edit items for correct grammar, punctuation, capitalization and spelling.

It is common for instructors to write the stem and the correct choice together when they first write the question. The instructor words the distractors later, often less carefully and in some way that is inconsistent with the correct choice. These differences become undesirable clues about the right and wrong choices.

11. Simplify vocabulary so that reading comprehension does not interfere with testing the content intended.

There’s not much point asking a question that the examinee doesn’t understand. If the examinee doesn’t understand the technical terms (the words or concepts being tested), that’s one thing. But if the examinee doesn’t understand the other terms, the question simply won’t reach the examinee’s knowledge.

12. Minimize reading time. Avoid excessive verbiage.

Students whose first language is not English often have trouble with long questions.

13. Proofread each item.

Despite editorial care, remarkably many simple mistakes survive review or are introduced by mechanical error (e.g. cutting and pasting from a master list to the test itself).


14. Make the directions as clear as possible.

Consider the following confusingly-written question:

A program will accept a string of letters and digits into a password field. After it accepts the string, it asks for a comparison string, and on accepting a new input from the customer, it compares the first string against the second and rejects the password entry if the strings do not match.

  1. There are 218340105584896 possible tests of 8-character passwords.
  2. This method of password verification is subject to the risk of input-buffer overflow from an excessively long password entry
  3. This specification is seriously ambiguous because it doesn’t tell us whether the program accepts or rejects/filters non-alphanumeric characters into the second password entry

Let us pretend that each of these answers could be correct. Which is correct for this question? Is the stem calling for an analysis of the number of possible tests, the risks of the method, the quality of the specification, or something else?

The stem should make clear whether the question is looking for the best single answer or potentially more than one, and whether the question is asking for facts, opinion, examples, reasoning, a calculation, or something else.

The reader should never have to read the set of possible answers to understand what the question is asking.

15. Make the stem as brief as possible.

This is part of the same recommendation as Heuristic #12 above. If the entire question should be as short as possible (#12), the stem should be as short as possible.

However, “as short as possible” does not necessarily mean “short.”

Here are some examples:

  • The stem describes some aspect of the program in enough detail that it is possible to compute the number of possible software test cases. The choices include the correct answer and three miscalculations.
  • The stem describes a software development project in enough detail that the reader can see the possibility of doing a variety of tasks and the benefits they might offer to the project, and then asks the reader to prioritize some of the tasks. The choices are of the form, “X is more urgent than Y.”
  • The stem describes a potential error in the code, the types of visible symptoms that this error could cause, and then calls for selection of the best test technique for exposing this type of bug.
  • The stem quotes part of a product specification and then asks the reader to identify an ambiguity or to identify the most serious impact on test design an ambiguity like this might cause.
  • The stem describes a test, a failure exposed by the test, a stakeholder (who has certain concerns) who receives failure reports and is involved in decisions about the budget for the testing effort, and asks which description of the failure would be most likely to be perceived as significant by that stakeholder. An even more interesting question (faced frequently by testers in the real world) is which description would be perceived as significant (credible, worth reading and worth fixing) by Stakeholder 1 and which other description would be more persuasive for Stakeholder 2. (Someone concerned with next months’ sales might assess risk very differently from someone concerned with engineering / maintenance cost of a product line over a 5-year period. Both concerns are valid, but a good tester might raise different consequences of the same bug for the marketer than for the maintenance manager).

Another trend for writing test questions that address higher-level learning is to write a very long and detailed stem followed by several multiple choice questions based on the same scenario.

Long questions like these are fair game (normal cases) in exams for lawyers, such as the Multistate Bar Exam. They are looked on with less favor in discplines that don’t demand the same level of skill in quickly reading/understanding complex blocks of text. Therefore, for many engineering exams (for example), questions like these are probably less popular.

  • They discriminate against people whose first language is not English and who are therefore slower readers of complex English text, or more generally against anyone who is a slow reader, because the exam is time-pressed.
  • They discriminate against people who understand the underlying material and who can reach an application of that material to real-life-complexity circumstances if they can work with a genuine situation or a realistic model (something they can appreciate in a hands-on way) but who are not so good at working from hypotheticals that abstract out all information that the examiner considers inessential.
  • They can cause a cascading failure. If the exam includes 10 questions based on one hypothetical and the examinee misunderstands that one hypothetical, she might blow all 10 questions.
  • They can demoralize an examinee who lacks confidence/skill with this type of question, resulting in a bad score because the examinee stops trying to do well on the test.

However, in a low-stakes exam without time limits, those concerns are less important. The exam becomes practice for this type of analysis, rather than punishment for not being good at it.

In software testing, we are constantly trying to simplify a complex product into testable lines of attack. We ignore most aspects of the product and design tests for a few aspects, considered on their own or in combination with each other. We build explicit or implicit mental models of the product under test, and work from those to the tests, and from the tests back to the models (to help us decide what the results should be). Therefore, drawing out the implications of a complex system is a survival skill for testers and questions of this style are entirely fair game–in a low stakes exam, designed to help the student learn, rather than a high-stakes exam designed to create consequences based on an estimate of what the student knows.

16. Place the main idea of the item in the stem, not in the choices.

Some instructors adopt an intentional style in which the stem is extremely short and the question is largely defined in the choices.

The confusingly-written question in Heuristic #14 was an example of a case in which the reader can’t tell what the question is asking until he reads the choices. In #14, there were two problems:

  • the stem didn’t state what question it was asking
  • the choices themselves were fundamentally different, asking about different dimensions of the situation described in the stem rather than exploring one dimension with a correct answer and distracting mistakes. The reader had to guess / decide which dimension was of interest as well as deciding which answer might be correct.

Suppose we fix the second problem but still have a stem so short that you don’t know what the question is asking for until you read the options. That’s the issue addressed here (Heuristic #16).

For example, here is a better-written question that doesn’t pass muster under Heuristic #16:

A software oracle:

  1. is defined this way
  2. is defined this other way
  3. is defined this other way

The better question under this heuristic would be:

What is the definition of a software oracle?

  1. this definition
  2. this other definition
  3. this other other definition

As long as the options are strictly parallel (they are alternative answers to the same implied question), I don’t think this is a serious a problem.

17. Avoid irrelevant information (window dressing).

Imagine a question that includes several types of information in its description of some aspect of a computer program:

  • details about how the program was written
  • details about how the program will be used
  • details about the stakeholders who are funding or authorizing the project
  • details about ways in which products like this have failed before

All of these details might be relevant to the question, but probably most of them are not relevant to any particular question. For example, to calculate the theoretically-possible number of tests of part of the program doesn’t require any knowledge of the stakeholders.


  • is irrelevant if you don’t need it to determine which option is the correct answer
  • unless the reader’s ability to wade through irrelevant information of this type in order to get to the right underlying formula (or generally, the right approach to the problem) is part of the

18. Avoid negative words in the stem.

Here are some examples of stems with negative structure:

  • Which of the following is NOT a common definition of software testing?
  • Do NOT assign a priority to a bug report EXCEPT under what condition(s)?
  • You should generally compute code coverage statistics UNLESS:

For many people, these are harder than questions that ask for the same information in a positively-phrased way.

There is some evidence that there are cross-cultural variations. That is, these questions are harder for some people than others because (probably) of their original language training in childhood. Therefore, a bad result on this question might have more to do with the person’s heritage than with their knowledge or skill in software testing.

However, the ability to parse complex logical expressions is an important skill for a tester. Programmers make lots of bugs when they write code to implement things like:


So testers have to be able to design tests that anticipate the bug and check whether the programmer made it.

It is not unfair to ask a tester to handle some complex negation, if your intent is to test whether the tester can work with complex logical expressions. But if you think you are testing something else, and your question demands careful logic processing, you won’t know from a bad answer whether the problem was the content you thought you were testing or the logic that you didn’t consider.

Another problem is that many people read negative sentences as positive. Their eyes glaze over when they see the NOT and they answer the question as if it were positive (Which of the following IS a common definition of software testing?) Unless you are testing for glazy eyes, you should make the negation as visible as possible I use ITALICIZED ALL-CAPS BOLDFACE in the examples above.


19. Develop as many effective options as you can, but two or three may be sufficient.

Imagine an exam with 100 questions. All of them have two options. Someone who is randomly guessing should get 50% correct.

Now imagine an exam with 100 questions that all have four options. Under random guessing, the examinee should get 25%.

The issue of effectiveness is important because an answer that is not credible (not effective) won’t gain any guesses. For example, imagine that you saw this question on a quiz in a software testing course:

Green-box testing is:

  1. common at box manufacturers when they start preparing for the Energy Star rating
  2. a rarely-taught style of software testing
  3. a nickname used by automobile manufacturers for tests of hybrid cars
  4. the name of Glen Myers’ favorite book

I suspect that most students would pick choice 2 because 1 and 3 are irrelevant to the course and 4 is ridiculous (if it was a proper name, for example, “Green-box testing” would be capitalized.) So even though there appear to be 4 choices, there is really only 1 effective one.

The number of choices is important, as is the correction-for-guessing penalty, if you are using multiple choice test results to assign a grade or assess the student’s knowledge in way that carries consequences for the student.

The number of choices — the final score — is much less important if the quiz is for learning support rather than for assessment.

The Open Certification exam is for assessment and has a final score, but it is different from other exams in that examinees can review the questions and consider the answers in advance. Statistical theories of scoring just don’t apply well under those conditions.

20. Vary the location of the right answer according to the number of options. Assign the position of the right answer randomly.

There’s an old rule of thumb–if you don’t know the answer, choose the second one in the list. Some inexperienced exam-writers tend to put the correct answer in the same location more often than if they varied location randomly. Experienced exam-writers use a randomization method to eliminate this bias.

21. Place options in logical or numerical order.

The example that Haladyna gives is numeric. If you’re going to ask the examinee to choose the right number from a list of choices, then present them in order (like $5, $10, $20, $175) rather than randomly (like $20, $5, $175, $20).

In general, the idea underlying this heuristic is that the reader is less likely to make an accidental error (one unrelated to their knowledge of the subject under test) if the choices are ordered and formatted in the way that makes them as easy as possible to read quickly and understand correctly.

22. Keep options independent; choices should not be overlapping.

Assuming standard productivity metrics, how long should it take to create and document 100 boundary tests of simple input fields?

  1. 1 hour or less
  2. 5 hours or less
  3. between 3 and 7 hours
  4. more than 6 hours

These choices overlap. If you think the correct answer is 4 hours, which one do you pick as the correct answer?

Here is a style of question that I sometimes use that might look overlapping at first glance, but is not:

What is the best course of action in context C?

  1. Do X because of RY (the reason you should do Y).
  2. Do X because of RX (the reason you should do X, but a reason that the examinee is expected to know is impossible in context C)
  3. Do Y because of RY (the correct answer)
  4. Do Y because of RX

Two options tell you to do Y (the right thing to do), but for different reasons. One reason is appropriate, the other is not. The test is checking not just whether the examinee can decide what to do but whether she can correctly identify why to do it. This can be a hard question but if you expect a student to know why to do something, requiring them to pick the right reason as well as the right result is entirely fair.

23. Keep the options homogeneous in content and grammatical structure.

Inexperienced exam writers often accidentally introduce variation between the correct answer and the others. For example, the correct answer:

  • might be properly punctuated
  • might start with a capital letter (or not start with one) unlike the others
  • might end with a period or semi-colon (unlike the others)
  • might be present tense (the others in past tense)
  • might be active voice (the others in passive voice), etc.

The most common reason for this is that some exam authors write a long list of stems and correct answers, then fill the rest of the questions in later.

The nasty, sneaky tricky exam writer knows that test-wise students look for this type of variation and so introduces it deliberately:

Which is the right answer?

  1. this is the right answer
  2. This is the better-formatted second-best answer.
  3. this is a wrong answer
  4. this is another wrong answer

The test-savvy guesser will be drawn to answer 2 (bwaa-haaa-haa!)

Tricks are one way to keep down the scores of skilled guessers, but when students realize that you’re hitting them with trick questions, you can lose your credibility with them.

24. Keep the length of options about the same.

Which is the right answer?

  1. this is the wrong answer
  2. This is a really well-qualified and precisely-stated answer that is obviously more carefully considered than the others, so which one do you think is likely to be the right answer?.
  3. this is a wrong answer
  4. this is another wrong answer

25. “None of the above” should be used carefully.

As Haladyna points out, there is a fair bit of controversy over this heuristic:

  • If you use it, make sure that you make it the correct answer sometimes and the incorrect answer sometimes
  • Use it when you are trying to make the student actually solve a problem and assess the reasonability of the possible solutions

26. Avoid using “all of the above.”

The main argument against “all of the above” is that if there is an obviously incorrect option, then “all of the above” is obviously incorrect too. Thus, test-wise examinees can reduce the number of plausible options easily. If you are trying to statistically model the difficulty of the exam, or create correction factors (a “correction” is a penalty for guessing the wrong answer), then including an option that is obviously easier than the others makes the modeling messier.

In our context, we aren’t “correcting” for guessing or estimating the difficulty of the exam:

  • In the BBST (open book) exam, the goal is to get the student to read the material carefully and think about it. Difficulty of the question is more a function of difficulty of the source material than of the question.
  • In the Open Certification exam, every question appears on a public server, along a justification of the intended-correct answer and public commentary. Any examinee can review these questions and discussions. Some will, some won’t, some will remember what they read and some won’t, some will understand what they read and some won’t–how do you model the difficulty of questions this way? Whatever the models might be, the fact that the “all of the above” option is relatively easy for some students who have to guess is probably a minor factor.

Another argument is more general. Several authors, including Haladyna, Downing, & Rodriguez (2002), recommend against the complex question that allows more than one correct answer. This makes the question more difficult and more confusing for some students.

Even though some authors recommend against it, our question construction adopt a complex structure that allows selection of combinations (such as (a) and (b) as well as all of the above) — because other educational researchers consider this structure a useful vehicle for presenting difficult questions in a fair way. See for example Wongwiwatthananukit, Popovich & Bennett (2000) and their references.

Note that in the BBST / Open Certification structure, the fact that there is a combination choice or an all of the above choice is not informative because most questions have these.

There is a particular difficulty with this structure, however. Consider this question:

Choose the answer:

  1. This is the best choice
  2. This is a bad choice
  3. This is a reasonable answer, but (a) is far better–or this is really a subset of (a), weak on its own but it would be the only correct one if (a) was not present.
  4. (a) and (b)
  5. (a) and (c)
  6. (b) and (c)
  7. (a) and (b) and (c)

In this case, the student will have an unfairly hard time choosing between (a) and (e). We have created questions like this accidentally, but when we recognize this problem, we fix it in one of these ways:

Alternative 1. Choose the answer:

    1. This is the best choice
    2. This is a bad choice
    3. This is a reasonable answer, but (a) is far better–or this is really a subset of (a), weak on its own but it would be the only correct one if (a) was not present.
    4. This is a bad choice
    5. (a) and (b)
    6. (b) and (c)
    7. (a) and (b) and (c)

    In this case, we make sure that (a) and (c) is not available for selection.

Alternative 2. Choose the answer:

    1. This is the best choice
    2. This is a bad choice
    3. This is a reasonable answer, but (a) is far better–or this is really a subset of (a), weak on its own but it would be the only correct one if (a) was not present.
    4. This is a bad choice

    In this case, no combinations are available for selection.

27. Avoid negative words such as not or except.

This is the same advice, for the options, as we provided in Heuristic #18 for the stem, for the same reasons.

28. Avoid options that give clues to the right answer.

Some of the mistakes mentioned by Haladyna, Downing, & Rodriguez (2002) are:

  • Broad assertions that are probably incorrect, such as always, never, must, and absolutely.
  • Choices that sound like words in the stem, or words that sound like the correct answer
  • Grammatical inconsistencies, length inconsistencies, formatting inconsistencies, extra qualifiers or other obvious inconsistencies that point to the correct choice
  • Pairs or triplet options that point to the correct choice. For example, if every combination option includes (a) (such as (a) and (b) and (a) and (c) and all of the above) then it is pretty obvious that (a) is probably correct and any answer that excludes (a) (such as (b)) is probably wrong.

29. Make all distractors plausible.

This is important for two reasons:

  • If you are trying to do statistical modeling of the difficulty of the exam (“There are 4 choices in this question, therefore there is only a 25% chance of a correct answer from guessing”) then implausible distractors invalidate the model because few people will make this guess. However, in our tests, we aren’t doing this modeling so this doesn’t matter.
  • An implausible choice is a waste of space and time. If no one will make this choice, it is not really a choice. It is just extra text to read.

One reason that an implausible distractor is sometimes valuable is that sometimes students do pick obviously unreasonable distractors. In my experience, this happens when the student is:

  • ill, and not able to concentrate
  • falling asleep, and not able to concentrate
  • on drugs or drunk, and not able to concentrate or temporarily inflicted with a very strange sense of humor
  • copying answers (in a typical classroom test, looking at someone else’s exam a few feet away) and making a copying mistake.

I rarely design test questions with the intent of including a blatantly implausible option, but I am an inept enough test-writer that a few slip by anyway. These aren’t very interesting in the BBST course, but I have found them very useful in traditional quizzes in the traditionally-taught university course.

30. Use typical errors of students when you write distractors.

Suppose that you gave a fill-in-the-blank question to students. In this case, for example, you might ask the student to tell you the definition rather than giving students a list of definitions to choose from. If you gathered a large enough sample of fill-in-the-blank answers, you would know what the most common mistakes are. Then, when you create the multiple choice question, you can include these as distractors. The students who don’t know the right answer are likely to fall into one of the frequently-used wrong answers.

I rarely have the opportunity to build questions this way, but the principle carries over. When I write a question, I ask “If someone was going to make a mistake, what mistake would they make?”

31. Use humor if it is compatible with the teacher; avoid humor in a high-stakes test.

Robert F. McMorris, Roger A. Boothroyd, & ‌Debra J. Pietrangelo (1997) and Powers (2005) advocate for carefully controlled use of humor in tests and quizzes. I think this is reasonable in face-to-face instruction, once the students have come to know the instructor (or in a low-stakes test while students are getting to know the instructor). However, in a test that involves students from several cultures, who have varying degrees of experience with the English language, I think humor in a quiz can create more confusion and irritation than it is worth.


These notes summarize lessons that came out of the last Workshop on Open Certification (WOC 2007) and from private discussions related to BBST.

There’s a lot of excellent advice on writing multiple-choice test questions. Here are a few sources that I’ve found particularly helpful:

  1. Lorin Anderson, David Krathwohl, & Benjamin Bloom, Taxonomy for Learning, Teaching, and Assessing, A: A Revision of Bloom’s Taxonomy of Educational Objectives, Complete Edition, Longman Publishing, 2000.
  2. National Conference of Bar Examiners, Multistate Bar Examination Study Aids and Information Guides.
  3. Steven J. Burton, Richard R. Sudweeks, Paul F. Merrill, Bud Wood, How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty, Brigham Young University Testing Services, 1991.
  4. Thomas M. Haladyna, Writing Test Items to Evaluate Higher Order Thinking, Allyn & Bacon, 1997.
  5. Thomas M. Haladyna, Developing and Validating Multiple-Choice Test Items, 3rd Edition, Lawrence Erlbaum, 2004.
  6. Thomas M. Haladyna, Steven M. Downing, Michael C. Rodriguez, A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment, Applied Measurement in Education, 15(3), 309–334, 2002.
  7. Robert F. McMorris, Roger A. Boothroyd, & ‌Debra J. Pietrangelo, Humor in Educational Testing: A Review and Discussion, Applied Measurement in Education, 10(3), 269-297, 1997.
  8. Ted Powers, Engaging Students with Humor, Association for Psychological Science Observer, 18(12), December 2005.
  9. The Royal College of Physicians and Surgeons of Canada, Developing Multiple Choice Questions for the RCPSC Certification Examinations.
  10. Supakit Wongwiwatthananukit, Nicholas G. Popovich, & Deborah E. Bennett, Assessing pharmacy student knowledge on multiple-choice examinations using partial-credit scoring of combined-response multiple-choice items, American Journal of Pharmaceutical Education, Spring, 2000.
  11. Bibliography and links on Multiple Choice Questions at

References to my blogs

7th Workshop on Teaching Software Testing, January 18-20, 2008

Saturday, October 13th, 2007

This year’s Workshop on Teaching Software Testing (WTST) will be January 18-20 in Melbourne, Florida.

WTST is concerned with the practical aspects of teaching university-caliber software testing courses to academic or commercial students.

This year, we are particularly interested in teaching testing online. How can we help students develop testing skills and foster higher-order thinking in online courses?

We invite participation by:

  • academics who have experience teaching testing courses
  • practitioners who teach professional seminars on software testing
  • academic or practitioner instructors with significant online teaching experience and wisdom
  • one or two graduate students
  • a few seasoned teachers or testers who are beginning to build their strengths in teaching software testing.

There is no fee to attend this meeting. You pay for your seat through the value of your participation. Participation in the workshop is by invitation based on a proposal. We expect to accept 15 participants with an absolute upper bound of 25.

WTST is a workshop, not a typical conference. Our presentations serve to drive discussion. The target readers of workshop papers are the other participants, not archival readers. We are glad to start from already-published papers, if they are presented by the author and they would serve as a strong focus for valuable discussion.

In a typical presentation, the presenter speaks 10 to 90 minutes, followed by discussion. There is no fixed time for discussion. Past sessions’ discussions have run from 1 minute to 3 hours. During the discussion, a participant might ask the presenter simple or detailed questions, describe consistent or contrary experiences or data, present a different approach to the same problem, or (respectfully and collegially) argue with the presenter. In 20 hours of formal sessions, we expect to cover six to eight presentations.

We also have lightning presentations, time-limited to 5 minutes (plus discussion). These are fun and they often stimulate extended discussions over lunch and at night.

Presenters must provide materials that they share with the workshop under a Creative Commons license, allowing reuse by other teachers. Such materials will be posted at


There are few courses in software testing, but a large percentage of software engineering practitioners do test-related work as their main focus. Many of the available courses, academic and commercial, attempt to cover so much material that they are superficial and therefore ineffective for improving students skills or ability to analyze and address problems of real-life complexity. Online courses might, potentially, be a vehicle for providing excellent educational opportunities to a diverse pool of students.

Here are examples of ideas that might help us learn more about providing testing education online in ways that realize this potential:

  • Instructive examples: Have you tried teaching testing online? Can you show us some of what you did? What worked? What didn’t? Why? What can we learn from your experience?
  • Instructive examples from other domains: Have you tried teaching something else online and learned lessons that would be applicable to teaching testing? Can you build a bridge from your experience to testing?
  • Instructional techniques, for online instruction, that help students develop skill, insight, appreciation of models and modeling, or other higher-level knowledge of the field. Can you help us see how these apply to testing-related instruction?
  • Test-related topics that seem particularly well-suited to online instruction: Do you have a reasoned, detailed conjecture about how to bring a topic online effectively? Would a workshop discussion help you develop your ideas further? Would it help the other participants understand what can work online and how to make it happen?
  • Lessons learned teaching software testing: Do you have experiences from traditional teaching that seem general enough to apply well to the online environment?
  • Moving from Face-to-Face to Online Instruction – How does one turn a face-to-face class into an effective online class? What works? What needs to change?
  • Digital Backpack – Students and instructors bring a variety of tools and technologies to today’s fully online or web-enhanced classroom. Which tools do today’s teachers need? How can those tools be used? What about students?
  • The Scholarship of Teaching and Learning – How does one research one’s own teaching? What methods capture improved teaching and learning or reveal areas needing improvement? How is this work publishable to meet promotion and tenure requirements?
  • Qualitative Methods – From sloppy anecdotal reports to rigorous qualitative design. How can we use qualitative methods to conduct research on the teaching of computing, including software testing?


Please send a proposal BY DECEMBER 1, 2007 to Cem Kaner that identifies who you are, what your background is, what you would like to present, how long the presentation will take, any special equipment needs, and what written materials you will provide. Along with traditional presentations, we will gladly consider proposed activities and interactive demonstrations.

We will begin reviewing proposals on November 1. We encourage early submissions. It is unlikely but possible that we will have accepted a full set of presentation proposals by December 1.

Proposals should be between two and four pages long, in PDF format. We will post accepted proposals to

We review proposals in terms of their contribution to knowledge of HOW TO TEACH software testing. Proposals that present a purely theoretical advance in software testing, with weak ties to teaching and application, will not be accepted. Presentations that reiterate materials you have presented elsewhere might be welcome, but it is imperative that you identify the publication history of such work.

By submitting your proposal, you agree that, if we accept your proposal, you will submit a scholarly paper for discussion at the workshop by January 7, 2007. Workshop papers may be of any length and follow any standard scholarly style. We will post these at as they are received, for workshop participants to review before the workshop.


Please send a message by BY DECEMBER 1, 2007, to Cem Kaner that describes your background and interest in teaching software testing. What skills or knowledge do you bring to the meeting that would be of interest to the other participants?


Florida Tech’s Center for Software Testing Education & Research has been developing a collection of hybrid and online course materials for teaching black box software testing. We now have NSF funding to adapt these materials for implementation by a broader audience. We are forming an Advisory Board to guide this adaptation and the associated research on the effectiveness of the materials in diverse contexts. The Board will meet before WTST, on January 17, 2008. If you are interested in joining the Board and attending the January meeting, please read this invitation and submit an application.


Support for this meeting comes from the Association for Software Testing and Florida Institute of Technology.

The hosts of the meeting are:

Research Funding and Advisory Board for the Black Box Software Testing (BBST) Course

Friday, October 12th, 2007

Summary: With some new NSF funding, we are researching and revising BBST to make it more available and more useful to more people around the world. The course materials will continue to be available for free. If you are interesting in joining an advisory board that helps us set direction for the course and the research surrounding the course, please contact me, describing your background in software-testing-related education, in education-related research, and your reason(s) for wanting to join the Board.

Starting as a joint project with Hung Quoc Nguyen in 1993, I’ve done a lot of development of a broad set of course materials for black box software testing. The National Science Foundation approved a project (EIA-0113539 ITR/SY+PE “Improving the Education of Software Testers) that evolved my commercial-audience course materials for an academic audience and researched learning issues associated with testing. The resulting course materials are at, with lots of papers at and The course materials are available for everyone’s use, for free, under a Creative Commons license.

During that research, I teamed up with Rebecca Fiedler, an experienced teacher (now an Assistant Professor of Education at St. Mary-of-the-Woods College in Terre Haute, Indiana, and also now my wife.) The course that Rebecca and I evolved turned traditional course design inside out in order to encourage students’ involvement, skill development and critical thinking. Rather than using class time for lectures and students’ private time for activities (labs, assignments, debates, etc.), we videotaped the lectures and required students to watch them before coming to class. We used class time for coached activities centered more on the students than the professor.

This looked like a pretty good teaching approach, our students liked it, and the National Science Foundation funded a project to extend this approach to developing course materials on software engineering ethics in 2006. (If you would like to collaborate with us on this project, or if you are a law student interested in a paid research internship, contact Cem Kaner.)

Recently, the National Science Foundation approved Dr. Fiedler’s and my project to improve the BBST course itself, “Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing.” With funding running from October 1, 2007 through 2010, our primary goals are:

  • develop and sustain a cadre of academic, in-house, and commercial instructors via:
    • creating and offering an instructor orientation course online;
    • establishing an ongoing online instructors’ forum; and
    • hosting a number of face-to-face instructor meetings
  • offer and evaluate the course at collaborating research sites (including both universities and businesses)
  • analyze several collections of in-class activities to abstract a set of themes / patterns that can help instructors quickly create new activities as needed; and
  • extend instructional support material including grading guides and a pool of exam questions for teaching the course.

All of our materials—such as videos, slides, exams, grading guides, and instructor manuals—are Creative Commons licensed. Most are available freely to the public. A few items designed to help instructors grade student work will be available at no charge, but only to instructors.

Several individuals and organizations have agreed to collaborate in this work, including:

  • AppLabs Technologies. Representative: Geetha Narayanan, CSQA, PMP; Shyam Sunder Depuru.
  • Aztechsoft. Representative: Ajay Bhagwat.
  • The Association for Software Testing. Representative: Michael Kelly, President.
  • AST is breaking the course into several focused, online, mini-courses that run 1 month each. The courses are offered, for free, to AST members. AST is starting its second teaching of the Foundations course this week. We’ll teach Bug Advocacy in a month. As we develop these courses, we are training instructors who, after sufficient training, will teach the course(s) they are trained to teach for AST (free courses) as well as at their school or company (for free or fee, as they choose).

  • Dalhousie University. Representative: Professor Morven Gentleman.
  • Huston-Tillotson University, Computer Science Department. Representative: Allen M. Johnson, Jr., Ph.D.
  • Microsoft. Representative: Marianne Guntow.
  • PerfTest Plus. Representative: Scott Barber.
  • Quardev Laboratories. Representative: Jonathan Bach.
  • University of Illinois at Springfield, Computer Sciences Program. Representative: Dr. Keith W. Miller.
  • University of Latvia. Representative, Professor Juris Borzovs.

If you would like to collaborate on this project as well:

  1. Please read our research proposal.
  2. Please consider your ability to make a financial commitment. We are not asking for donations (well, of course, we would love to get donations, but they are not required) but you or your company would have to absorb the cost of travel to Board of Advisor meetings and you would probably come to the Workshop on Teaching Software Testing and/or the Conference of the Association for Software Testing. Additionally, teaching the course at your organization and collecting the relevant data would be at your expense. (My consultation to you on this teaching would be free, but if you needed me to fly to your site, that would be at your expense and might involve a fee.) We have a little bit of NSF money to subsidize travel to Board of Advisor meetings ($15,000 total for the three years) so we can subsidize travel to a small degree. But it is very limited, and especially little is available for corporations.
  3. Please consider your involvement. What do you want to do?
    • Join the advisory board, help guide the project?
    • Collaborate on the project as a fellow instructor (and get instructor training) ?
    • Come to the Workshop on Teaching Software Testing?
    • Help develop a Body of Knowledge to support the course materials?
    • Participate as a lecturer or on-camera discussant on the video courses?
    • Other stuff, such as …???
  4. Send me a note, that covers 1-3, introduces you and describes your background and interest.

The first meeting of the Advisory Board is January 17, 2008, in Melbourne, Florida. We will host the Workshop on Teaching Software Testing (WTST 2008) from January 18-20. I’ll post a Call for Participation for WTST 2008 on this blog tomorrow.

Open Certification at ACM SIGCSE

Sunday, April 1st, 2007

Tim Coulter and I presented a poster session at the ACM SIGCSE conference.

Here is our poster (page 1) and (page 2)

Here is our paper: Creating an Open Certification Process

I was glad to see the Agile Alliance’s position on certification.

It’s unfortunate that several of the groups who present themselves as professional associations for software testers or software quality workers (e.g. American Society for Quality, British Computer Society, International Institute for Software Testing, Quality Assurance Institute) are selling certifications rather than warning people off of them as Agile Alliance is doing.

We created the Open Certification as an alternative approach. It still doesn’t measure skill. But it does offer four advantages:

  1. The large pool of questions will be public, with references. They will form a study guide.
  2. The pool is not derived from a single (antiquated) view of software testing. Different people with different viewpoints can add their own questions to the pool. If they have well-documented questions/answers, the questions will be accepted and can be included in a customizable exam.
  3. The exam can be run any time, anywhere. Instead of relying on a certificate, an employer can ask a candidate to retake the test and then discuss the candidate’s answers with her. The discussion will be more informative than any number of multiple-false answers.
  4. The exam is free.

The open certification is a bridge between the current certifications and the skill-based evaluations that interviewers should create for themselves or that might someday be available as exams (but are not available today).

The open certification exam isn’t available yet. We’re close to being finished the Question Server (see the paper), which is one of the key components. We’ll start work on the next piece at the Second Workshop on Open Certification, which will be right after CAST 2007. (More on both soon…)

In the meantime, my recommendation is simple:

  • If you are a job candidate considering becoming certified, save your money. Take courses that teach you about the practice of testing rather than about how to pass an exam. (If you want a free course, go to
  • If you are an employer considering what qualifications that testing candidates should have, my article on recruiting might help. If you can’t live with that and must use an examination, the following probably has as much validity as the certifications–and it has a high failure rate, so it must demonstrate high standards, right?
Test THIS Certification Exam
Professor Cem Kaner
is a
Purple People-Eating Magic Monkey
…with a Battle Rating of 9.1

To see if your Food-Eating Battle Monkey can
defeat Professor Cem Kaner, enter your name:

Schools of software testing

Friday, December 22nd, 2006

Every few months, someone asks why James Bach, Bret Pettichord, and I discuss the software testing community in terms of “schools” or suggests that the idea is misguided, arrogant, divisive, inaccurate, or otherwise A Bad Thing. The most recent discussion is happening on the software-testing list (the context-driven testing school’s list on It’s time that I blogged a clarifying response.
Perhaps in 1993, I started noticing that many test organizations (and many test-related authors, speakers and consultants, including some friends and other colleagues I respected) relied heavily — almost exclusively — on one or two main testing techniques. In my discussions with them, they typically seemed unaware of other techniques or uninterested in them.

  • For example, many testers talked about domain testing (boundary and equivalence class analysis) as the fundamental test design technique. You can generalize the method far beyond its original scope (analysis of input fields) to a strategy for reducing the infinity of potential tests to a manageably small, powerful subset via a stratified sampling strategy (see the printer testing chapter of Testing Computer Software for an example of this type of analysis). This is a compelling approach–it yields well-considered tests and enough of them to keep you busy from the start to the end of the project.
  • As a competing example, many people talked about scenario testing as the fundamental approach. They saw domain tests are mechanical and relatively trivial. Scenarios went to the meaning of the specification (where there was one) and to the business value and business risk of the (commercial) application. You needed subject matter experts to do really good scenario testing. To some of my colleagues, this greater need for deep understanding of the application was proof in its own right that scenario testing was far more important than the more mechanistic-seeming domain testing.

In 1995, James Bach and I met face-to-face for the first time. (We had corresponded by email for a long time, but never met.) We ended up spending half a day in a coffee shop at the Dallas airport comparing notes on testing strategy. He too had been listing these dominating techniques and puzzling over the extent to which they seemed to individually dominate the test-design thinking of many colleagues. James was putting together a list that he would soon publish in the first draft of the Satisfice Heuristic Test Strategy Model. I was beginning to use the list for consciousness-raising in my classes and consulting, encouraging people to add one or two more techniques to their repertoire.
James and I were pleasantly shocked to discover that our lists were essentially the same. Our names for the techniques were different, but the list covered the same approaches and we had both seen the same tunnel-vision (one or more companies or groups that did reasonably good work–as measured by the quality of bugs they were finding–that relied primarily on just this one technique, for each of the techniques). I think it was in that discussion that I suggested a comparison to Thomas Kuhn’s notion of paradigms (for a summary, read this article in the Stanford Encyclopedia of Philosophy), which we studied at some length in my graduate school (McMaster University, Psychology Ph.D. program).
The essence of a paradigm is that it creates (or defines) a mainstream of thinking, providing insights and direction for future research or work. It provides a structure for deciding what is interesting, what is relevant, what is important–and implicitly, it defines limits, what is not relevant, not particularly interesting, maybe not possible or not wise. The paradigm creates a structure for solving puzzles and people who solve the puzzles seen as important in the field are highly respected. Scientific paradigms often incorporate paradigmatic cases–exemplars–especially valuable examples that serve as models for future work or molds for future thought. At that meeting in 1995, or soon after that, we concluded that the set of dominant test techniques we were looking at could be thought of as paradigms for some / many people in the field.
This idea wouldn’t make sense in a mature science, because there is one dominating paradigm that creates a common (field-wide) vocabulary, a common sense of the history of the field, and a common set of cherished exemplars. However, in less mature disciplines that have not reached consensus, fragmentation is common. Here’s how the Stanford Encyclopedia of Philosophy summarizes it:

“In the postscript to the second edition of The Structure of Scientific Revolutions Kuhn says of paradigms in this sense that they are “the most novel and least understood aspect of this bookâ€? (1962/1970a, 187). The claim that the consensus of a disciplinary matrix is primarily agreement on paradigms-as-exemplars is intended to explain the nature of normal science and the process of crisis, revolution, and renewal of normal science. It also explains the birth of a mature science. Kuhn describes an immature science, in what he sometimes calls its ‘pre-paradigm’ period, as lacking consensus. Competing schools of thought possess differing procedures, theories, even metaphysical presuppositions. Consequently there is little opportunity for collective progress. Even localized progress by a particular school is made difficult, since much intellectual energy is put into arguing over the fundamentals with other schools instead of developing a research tradition. However, progress is not impossible, and one school may make a breakthrough whereby the shared problems of the competing schools are solved in a particularly impressive fashion. This success draws away adherents from the other schools, and a widespread consensus is formed around the new puzzle-solutions.”

James and I weren’t yet drawn to the idea of well-defined schools, because we see schools as having not only a shared perspective but also a missionary character (more on that later) and what we were focused on was the shared perspective. But the idea of competing schools allowed us conceptual leeway for thinking about, and describing, what we were seeing. We talked about it in classes and small meetings and were eventually invited to present it (Paradigms of Black Box Software Testing) as a pair of keynotes at the 16th International Conference and Exposition on Testing Computer Software (1999) and then at STAR West (1999).
At this point (1999), we listed 10 key approaches to testing:

  • Domain driven
  • Stress driven
  • Specification driven
  • Risk driven
  • Random / statistical
  • Function
  • Regression
  • Scenario / use case / transaction flow
  • User testing
  • Exploratory

There’s a lot of awkwardness in this list, but our intent was descriptive and this is what we were seeing. When people would describe how they tested to us, their descriptions often focused on only one (or two, occasionally three) of these ten techniques, and they treated the described technique(s), or key examples of it, as guides for how to design tests in the future. We were trying to capture that.
At this point, we weren’t explicitly advocating a school of our own. We had a shared perspective, we were talking about teaching people to pick their techniques based on their project’s context and James’ Heuristic Test Strategy Model (go here for the current version) made this explicit, but we were still early in our thinking about it. In contrast, the list of dominating techniques, with minor evolution, captured a significant pattern in our observations over perhaps 7-10 of each of our years.
I don’t think that many people saw this work as divisive or offensive — some did and we got some very harsh feedback from a few people. Others were intrigued or politely bored.
Several people were confused by it, not least because the techniques on this list were far from mutually exclusive. For example, you can apply a domain-driven analysis of the variables of individual functions in an exploratory way. Is this domain testing, function testing or exploratory testing?

  • One answer is “yes” — all three.
  • Another answer is that it is whichever technique is dominant in the mind of the tester who is doing this testing.
  • Another answer is, “Gosh, that is confusing, isn’t it? Maybe this model of “paradigms” isn’t the right subdivider of the diverging lines of testing thought.”

Over time, our thinking shifted about which answer was correct. Each of these would have been my answer — at different times. (My current answer is the third one.)
Another factor that was puzzling us was the weakness of communication among leaders in the field. At conferences, we would speak the same words but with different meanings. Even the most fundamental terms, like “test case” carried several different meanings–and we weren’t acknowledging the differences or talking about them. Many speakers would simply assume that everyone knew what term X meant, agreed with that definition, and agreed with whatever practices were impliedly good that went with those definitions. The result was that we often talked past each other at conferences, disagreeing in ways that many people in the field, especially relative newcomers, found hard to recognize or understand.
It’s easy to say that all testing involves analysis of the program, evaluation of the best types of tests to run, design of the tests, execution, skilled troubleshooting, and effective communication of the results. Analysis. Evaluation. Design. Test. Execution. Troubleshooting. Effective Communication. We all know what those words mean, right? We all know what good analysis is, right? So, basically, we all agree, right?
Well, maybe not. We can use the same words but come up with different analyses, different evaluations, different tests, different ideas about how to communicate effectively, and so on.
Should we gloss over the differences, or look for patterns in them?
James Bach, Bret Pettichord and I muddled through this as we wrote Lessons Learned in Software Testing. We brought a variety of other people into the discussions but as I vaguely recall it, the critical discussions happened in the context of the book. Bret Pettichord put the idea into a first-draft presentation for the 2003 Workshop on Teaching Software Testing. He has polished it since, but it is still very much a work in progress.
I’m still not ready to publish my version because I haven’t finished the literature review that I’d want to publish with it. We were glad to see Bret go forward with his talks, because they opened the door for peer review that provides the foundation for more polished later papers.
The idea that there can be different schools of thought in a field is hardly a new one — just check the 1,180,000 search results you get from Google or the 1,230,000 results you get from Yahoo when you search for the quoted phrase “schools of thought”.
Not everyone finds the notion of divergent schools of thought a useful heuristic–for example, read this discussion of legal schools of thought. However, in studying, teaching and researching experimental psychology, the identification of some schools was extremely useful. Not everyone belonged to one of the key competing schools. Not every piece of research was driven by a competing-school philosophy. But there were organizing clusters of ideas with charismatic advocates that guided the thinking and work of several people and generated useful results. There were debates between leaders of different schools, sometimes very sharp debates, and those debates clarified differences, points of agreement, and points of open exploration.
As I think of schools of thought, a school of testing would have several desirable characteristics:

  • The members share several fundamental beliefs, broadly agree on vocabulary, and will approach similar problems in compatible ways
    • members typically cite the same books or papers (or books and papers that same the same things as the ones most people cite)
    • members often refer to the same stories / myths and the same justifications for their practices
  • Even though there is variation from individual to individual, the thinking of the school is fairly comprehensive. It guides thinking about most areas of the job, such as:
    • how to analyze a product
    • what test techniques would be useful
    • how to decide that X is an example of good work or an example of weak work (or not an example of either)
    • how to interpret test results
    • how much troubleshooting of apparent failures, why, and how much troubleshooting by the testers
    • how to staff a test team
    • how to train testers and what they should be trained in
    • what skills (and what level of skill diversity) are valuable on the team
    • how to budget, how to reach agreements with others (management, programmers) on scope of testing, goals of testing, budget, release criteria, metrics, etc.
  • To my way of thinking, a school also guides thought in terms of how you should interact with peers
    • what kinds of argument are polite and appropriate in criticizing others’ work
    • what kinds of evidence are persuasive
    • they provide forums for discussion among school members, helping individuals refine their understanding and figure out how to solve not-yet-solved puzzles
  • I also see schools as proselytic
    • they think their view is right
    • they think you should think their view is right
    • they promote their view
  • I think that the public face of many schools is the face(s) of the identified leader(s). Bret’s attempts to characterize schools in terms of human representatives was intended as constructive and respectful to the schools involved.

I don’t think the testing community maps perfectly to this. For example (as the biggest example, in my view), very few people are willing to identify themselves as leaders or members of the other (other than context-driven) schools of testing. (I think Agile TDD is another school (the fifth of Bret’s four) and that there are clear thought-leaders there, but I’m not sure that they’ve embraced the idea that they are a school either.) Despite that, I think the notion of division into competing schools is a useful heuristic.

At my time of writing, I think the best breakdown of the schools is:

  • Factory school: emphasis on reduction of testing tasks to routines that can be automated or delegated to cheap labor.
  • Control school: emphasis on standards and processes that enforce or rely heavily on standards.
  • Test-driven school: emphasis on code-focused testing by programmers.
  • Analytical school: emphasis on analytical methods for assessing the quality of the software, including improvement of testability by improved precision of specifications and many types of modeling.
  • Context-drive school: emphasis on adapting to the circumstances under which the product is developed and used.

I think this division helps me interpret some of what I read in articles and what I hear at conferences. I think it helps me explain–or at least rationally characterize–differences to people who I’m coaching or training, who are just becoming conscious of the professional-level discussions in the field.
Acknowledging the imperfect mapping, it’s still interesting to ask, as I read something from someone in the field, whether it fits in any of the groups I think of as a school and if so, whether it gives me insight into any of that school’s answers to the issues in the list above–and if so, whether that insight tells me more that I should think about for my own approach (along with giving me better ways to talk about or talk to people who follow the other approach).
When Bret first started giving talks about 4 schools of software testing, several people reacted negatively:

  • they felt it was divisive
  • they felt that it created a bad public impression because it would be better for business (for all consultants) if it looks as though we all agree on the basics and therefore we are all experts whose knowledge and experience can be trusted
  • they felt that it was inappropriate because competent testers all pretty much agree on the fundamentals.

One of our colleagues (a friend) chastised us for this competing-schools analysis. I think he thought that we were saying this for marketing purposes, that we actually agreed with everyone else on the fundamentals and knew that we all agreed on the fundamentals. We assured him that we weren’t kidding. We might be wrong, but we were genuinely convinced that the field faced powerful disagreements even on the very basics. Our colleague decided to respond with a survey that asked a series of basic questions about testing. Some questions checked basic concepts. Others identified a situation and asked about the best course of action. He was able to get a fairly broad set of responses from a diverse group of people, many (most?) of them senior testers. The result was to highlight the broad disagreement in the field. He chose not to publish (I don’t think he was suppressing his results; I think it takes a lot of work to go from an informal survey to something formal enough to publish). Perhaps someone doing a M.Sc. thesis in the field would like to follow this up with a more formally controlled survey of an appropriately stratified sample across approaches to testing, experience levels, industry and perhaps geographic location. But until I see something better, what I saw in the summary of results given to me looked consistent with what I’ve been seeing in the field for 23 years–we have basic, fundamental, foundational disagreements about the nature of testing, how to do it, what it means to test, who our clients are, what our professional responsibilities are, what educational qualifications are appropriate, how to research the product, how to identify failure, how to report failure, what the value of regression testing is, how to assess the value of a test, etc., etc.
So is there anything we can learn from these differences?

  • One of the enormous benefits of competing schools is that they create a dialectic.
    • It is one thing for theoreticians who don’t have much influence in a debate to characterize the arguments and positions of other people. Those characterizations are descriptive, perhaps predictiive. But they don’t drive the debate. I think this is the kind of school characterization that was being attacked on Volokh’s page.
    • It’s very different for someone in a long term debate to frame their position as a contrast with others and invite response. If the other side steps up, you get a debate that sharpens the distinctions, brings into greater clarity the points of agreement, and highlights the open issues that neither side is confident in. It also creates a collection of documented disagreements, documented conflicting predictions and therefore a foundation for scientific research that can influence the debate. This is what I saw in psychology (first-hand, watching leaders in the field structure their work around controversy).
    • We know full well that we’re making mistakes in our characterization of the other views. We aren’t intentionally making mistakes, and we correct the ones that we finally realize are mistakes, but nevertheless, we have incorrect assumptions and conclusions within our approach to testing and in our understanding of the other folks’ approaches to testing. Everybody else in the field is making mistakes too. The stronger the debate — I don’t mean the nastier, I mean the more seriously we do it, the better we research and/or consider our responses, etc. — the more of those mistakes we’ll collectively bring to the surface and the more opportunities for common ground we will discover. That won’t necessarily lead to the One True School ever, because some of the differences relate to key human values (for example, think some of us have deep personal-values differences in our notions of the relative worths of the individual human versus the identified process ). But it might lead to the next generation of debate, where a few things are indeed accepted as foundational by everyone and the new disagreements are better informed, with a better foundation of data and insight.
  • The dialectic doesn’t work very well if the other side won’t play.
    • Pretending that there aren’t deep disagreements won’t make them go away
  • Even if the other side won’t engage, the schools-approach creates a set of organizing heuristics. We have a vast literature. There are over 1000 theses and dissertations in the field. There are conferences, magazines, journals, lots of books, and new links (e.g. TDD) with areas of work previously considered separate. It’s not possible to work through that much material without imposing simplifying structures. The four-schools (I prefer five-schools) approach provides one useful structure. (“All models are wrong; some models are useful.” — George Box)

It’s 4:27 a.m. Reading what I’ve written above, I think I’m starting to ramble and I know that I’m running out of steam, so I’ll stop here.

Well, one last note.

It was never our intent to use the “schools” notion to demean other people in the field. What we are trying to do is to capture commonalities (of agreement and disagreement).

  • For example, I think there are a lot of people who really like the idea that some group (requirements analysts, business analysts, project manager, programmers) agree on an authoritative specification, that the proper role of Testing is to translate the specification to powerful test cases, automate the tests, run them as regression tests, and report metrics based on these runs.
  • Given that there are a lot of people who share this view, I’d like to be able to characterize the view and engage it, without having to address the minor variations that come up from person to person. Call the collective view whatever you want. The key issues for me are:
    • Many people do hold this view
    • Within the class of people who hold this view, what other views do they hold or disagree with?
    • If I am mischaracterizing the view, it is better for me to lay it out in public and get corrected than push it more privately to my students and clients and not get corrected
  • The fact is, certainly, that I think this view is defective. But I didn’t need to characterize it as a school to think it is often (i.e. in many contexts) a deeply misguided approach to testing. Nor do I need to set it up as a school to have leeway to publicly criticize it.
  • The value is getting the clearest and most authoritatively correct expression of the school that we can, so that if/when we want to attack it or coach it into becoming something else, we have the best starting point and the best peer review process that we can hope to achieve.

Comments are welcome. Maybe we can argue this into a better article.


Assessment Objectives. Part 3: Adapting the Anderson & Krathwohl taxonomy for software testing

Saturday, December 9th, 2006

I like the Anderson / Krathwohl approach (simple summaries here and here and here). One of the key improvements in this update to Bloom’s taxonomy was the inclusion of different types of knowledge (The Knowledge Dimension) as well as different levels of knowledge (The Cognitive Process Dimension).

The Knowledge Dimension The Cognitive Process Dimension
Remember Understand Apply Analyze Evaluate Create
Factual knowledge
Conceptual knowledge
Procedural knowledge
Metacognitive knowledge
Original Anderson Krathwohl model

This is a useful model, but there are things we learn as software testers that don’t quite fit in the knowledge categories. After several discussions with James Bach, I’ve adopted the following revision as the working model that I use:

The Knowledge Dimension The Cognitive Process Dimension
Remember Understand Apply Analyze Evaluate Create
Cognitive strategies
Anderson Krathwohl model modified for software testing

Here are my working definitions / descriptions:


  • A “statement of fact” is a statement that can be unambiguously proved true or false. For example, “James Bach was born in 1623” is a statement of fact. (But not true, for the James Bach we know and love.) A fact is the subject of a true statement of fact.
  • Facts include such things as:
    • Tidbits about famous people
    • Famous examples (the example might also be relevant to a concept, procedure, skill or attitude)
    • Items of knowledge about devices (for example, a description of an interoperability problem between two devices


  • A concept is a general idea. “Concepts are abstract in that they omit the differences of things in their extension, treating them as if they were identical.” (wikipedia: Concept).
  • In practical terms, we treat the following kinds of things as “concepts” in this taxonomy:
    • definitions
    • descriptions of relationships between things
    • descriptions of contrasts between things
    • description of the idea underlying a practice, process, task, heuristic (whatever)
  • Here’s a distinction that you might find useful.
    • Consider the oracle heuristic, “Compare the behavior of this program with a respected competitor and report a bug if this program’s behavior seems inconsistent with and possibly worse than the competitor’s.”
      • If I am merely describing the heuristic, I am giving you a concept.
      • If I tell you to make a decision based on this heuristic, I am giving you a rule.
    • Sometimes, a rule is a concept.
    • A rule is an imperative (“Stop at a red light”) or a causal relationship (“Two plus two yields four”) or a statement of a norm (“Don’t wear undershorts outside of your pants at formal meetings”).
    • The description / definition of the rule is the concept
    • Applying the rule in a straightforward way is application of a concept
    • The decision to puzzle through the value or applicability of a rule is in the realm of cognitive strategies.
    • The description of a rule in a formalized way is probably a model.


  • “Procedures” are algorithms. They include a reproducible set of steps for achieving a goal.
  • Consider the task of reporting a bug. Imagine that someone has
    • broken this task down into subtasks (simplify the steps, look for more general conditions, write a short descriptive summary, etc.) and
    • presented the tasks in a sequential order.
  • This description is intended as a procedure if the author expects you to do all of the steps in exactly this order every time.
  • This description is a cognitive strategy if it is meant to provide a set of ideas to help you think through what you have to do for a given bug, with the understanding that you may do different things in different orders each time, but find this a useful reference point as you go.

Cognitive Strategies

  • “Cognitive strategies are guiding procedures that students can use to help them complete less-structured tasks such as those in reading comprehension and writing. The concept of cognitive strategies and the research on cognitive strategies represent the third important advance in instruction.
    “There are some academic tasks that are “well-structured.” These tasks can be broken down into a fixed sequence of subtasks and steps that consistently lead to the same goal. The steps are concrete and visible. There is a specific, predictable algor ithm that can be followed, one that enables students to obtain the same result each time they perform the algorithmic operations. These well-structured tasks are taught by teaching each step of the algorithm to students. The results of the research on tea cher effects are particularly relevant in helping us learn how teach students algorithms they can use to complete well-structured tasks.
    “In contrast, reading comprehension, writing, and study skills are examples of less- structured tasks — tasks that cannot be broken down into a fixed sequence of subtasks and steps that consistently and unfailingly lead to the goal. Because these ta sks are less-structured and difficult, they have also been called higher-level tasks. These types of tasks do not have the fixed sequence that is part of well-structured tasks. One cannot develop algorithms that students can use to complete these tasks.”
    Gleefully pilfered from: Barak Rosenshine, Advances in Research on Instruction, Chapter 10 in J.W. Lloyd, E.J. Kameanui, and D. Chard (Eds.) (1997) Issues in educating students with disabilities. Mahwah, N.J.: Lawrence Erlbaum: Pp. 197-221.
  • In cognitive strategies, we include:
    • heuristics (fallible but useful decision rules)
      • guidelines (fallible but common descriptions of how to do things)
      • good (rather than “best” practices)
  • The relationship between cognitive strategies and models:
    • deciding to apply a model and figuring out how to apply a model involve cognitive strategies
    • deciding to create a model and figuring out how to create models to represent or simplify a problem involve cognitive strategies


    • the model itself is a simplified representation of something, done to give you insight into the thing you are modeling.
    • We aren’t sure that the distinction between models and the use of them is worthwhile, but it seems natural to us so we’re making it.


  • A model is
    • A simplified representation created to make something easier to understand, manipulate or predict some aspects of the modeled object or system.
    • Expression of something we don’t understand in terms of something we (think we) understand.
  • A state-machine representation of a program is a model.
  • Deciding to use a state-machine representation of a program as a vehicle for generating tests is a cognitive strategy.
  • Slavishly following someone’s step-by-step catalog of best practices for generating a state- machine model of a program in order to derive scripted test cases for some fool to follow is a procedure.
  • This definition of a model is a concept.
  • The assertion that Harry Robinson publishes papers on software testing and models is a statement of fact.
  • Sometimes, a rule is a model.
    • A rule is an imperative (“Stop at a red light”) or a causal relationship (“Two plus two yields four”) or a statement of a norm (“Don’t wear undershorts outside of your pants at formal meetings”).
    • A description / definition of the rule is probably a concept
    • A symbolic or generalized description of a rule is probably a model.


  • Skills are things that improve with practice.
    • Effective bug report writing is a skill, and includes several other skills.
    • Taking a visible failure and varying your test conditions until you find a simpler set of conditions that yields the same failure is skilled work. You get better at this type of thing over time.
  • Entries into this section will often be triggered by examples (in instructional materials) that demonstrate skilled work, like “Here’s how I use this technique” or “Here’s how I found that bug.”
  • The “here’s how” might be classed as a:
    • procedure
    • cognitive strategy, or
    • skill
  • In many cases, it would be accurate and useful to class it as both a skill and a cognitive strategy.


  • “An attitude is a persisting state that modifies an individual’s choices of action.” Robert M. Gagne, Leslie J. Briggs & Walter W. Wager (1992) “Principles of Instructional Design” (4th Ed),, p. 48.
  • Attitudes are often based on beliefs (a belief is a proposition that is held as true whether it has been verified true or not).Instructional materials often attempt to influence the student’s attitudes.
  • For example, when we teach students that complete testing is impossible, we might spin the information in different ways to influence student attitudes toward their work:
    • given the impossibility, testers must be creative and must actively consider what they can do at each moment that will yield the highest informational return for their project
    • given the impossibility, testers must conform to the carefully agreed procedures because these reflect agreements reached among the key stakeholders rather than diverting their time to the infinity of interesting alternatives
  • Attitudes are extremely controversial in our field and refusal to acknowledge legitimate differences (or even the existence of differences) has been the source of a great deal of ill will.
  • In general, if we identify an attitude or an attitude-related belief as something to include as an assessable item, we should expect to create questions that:
    • define the item without requiring the examinee to agree that it is true or valid
    • contrast it with a widely accepted alternative, without requiring the examinee to agree that it is better or preferable to the alternative
    • adopt it as the One True View, but with discussion notes that reference the controversy about this belief or attitude and make clear that this item will be accepted for some exams and bounced out of others.


  • Metacognition refers to the executive process that is involved in such tasks as:
    • planning (such as choosing which procedure or cognitive strategy to adopt for a specific task)
    • estimating how long it will take (or at least, deciding to estimate and figuring out what skill / procedure / slave-labor to apply to obtain that information)
    • monitoring how well you are applying the procedure or strategy
    • remembering a definition or realizing that you don’t remember it and rooting through Google for an adequate substitute
  • Much of context-driven testing involves metacognitive questions:
    • which test technique would be most useful for exposing what information that would be of what interest to who?
    • what areas are most critical to test next, in the face of this information about risks, stakeholder priorities, available skills, available resources?
  • Questions / issues that should get you thinking about metacognition are:
    • How to think about …
    • How to learn about …
    • How to talk about …
  • In the BBST course, the section on specification analysis includes a long metacognitive digression into active reading and strategies for getting good information value from the specification fragments you encounter, search for, or create.

Assessment Objectives. Part 2: Anderson & Krathwohl’s (2001) update to Bloom’s taxonomy

Wednesday, December 6th, 2006

Bloom’s taxonomy has been a cornerstone of instructional planning for 50 years. But there have been difficult questions in how to apply it.

The Bloom commission presented 6 levels of (cognitive) knowledge:

  • Knowledge (for example, can state or identify facts or ideas)
  • Comprehension (for example, can summarize ideas, restate them in other words, compare them to other ideas)
  • Application (for example, can use the knowledge to solve problems)
  • Analysis (for example, can identify patterns, identify components and explain how they connect to each other)
  • Synthesis (for example, can relate different things to each other, combine ideas to produce an explanation)
  • Evaluation (for example, can weigh costs and benefits of two different proposals)

For example, I “know” a fact (“the world is flat”) and I can prove that I know it by saying it (“The world is flat”). But I also know a procedure (“Do these 48 steps in this order to replicate this bug”) and I can prove that I know it by, er, ah — maybe it’s easier for me to prove I know it by DOING it than by saying it. (Have you ever tried to DO “the world is flat?”) Is it the same kind of thing to apply your knowledge of a fact as your knowledge of a procedure? What about knowing a model? If knowing a fact lets me say something and knowing a procedure helps me do something, maybe knowing a model helps me predict something. Say = do = predict = know?

Similarly, think about synthesizing or evaluating these different things? Is the type and level of knowledge really the same — would we test people’s knowledge in the same way — for these different kinds of things?

Extensive discussion led to upgrades, such as Anderson & Krathwohl’s and Marzano’s.

Rather than ordering knowledge on one dimension, from easiest-to-learn to hardest, the new approaches look at different types of information (facts, procedures, etc.) as well as different levels of knowledge (remember, apply, etc.).

I find the Anderson / Krathwohl approach (simple summaries here and here and here) more intuitive and more easy to apply, (YMMV, but that’s how it works for me…) Their model looks like this:

The Knowledge Dimension The Cognitive Process Dimension
Remember Understand Apply Analyze Evaluate Create
Factual knowledge
Conceptual knowledge
Procedural knowledge
Metacognitive knowledge

Metacognitive knowledge is knowing how to learn something. For example, much of what we know about troubleshooting and debugging and active reading is metacognitive knowledge.

  • Extending Anderson/Krathwohl for evaluation of testing knowledge
  • Assessment activities for certification in light of the Anderson/Krathwohl taxonomy

Next assessment sequence: Multiple Choice Questions: Design & Content.

Assessment Objectives. Part 1–Bloom’s Taxonomy

Friday, November 24th, 2006

This is the first of a set of articles on Assessment Objectives.

My primary objective in restarting my blog was to support the Open Certification Project for Software Testing by providing ideas on how we can go about developing it and a public space for discussion of those (and alternative) ideas.

I’m not opposed to professional certification or licensing in principle. But I do have four primary concerns:

  1. Certification involves some type of assessment (exams, interviews, demonstration of skilled work, evaluation of work products, etc.). Does the assessment measure what it purports to measure? What level of knowledge or skill does it actually evaluate? Is it a fair evaluation? Is it a reasonably accurate evaluation? Is it an evaluation of the performance of the person being assessed?
  2. Certification is often done with reference to a standard. How broadly accepted is this standard? How controversial is it? How relevant is it to competence in the field? Sometimes, a certification is more of a political effort–a way of recruiting supporters for a standard that has not achieved general acceptance. If so, is it being done with integrity–are the people studying for certification and taking the assessment properly informed?
  3. Certifiers can have many different motives. Is this part of an honest effort to improve the field? A profit-making venture to sell training and/or exams? A political effort to recruit relatively naive practitioners to a viewpoint by investing them in credential that they would otherwise be unlikely to adopt?
  4. The certifier and the people who have been certified often make claims about the meaning of the certification. A certification might be taken as significant by employers, schools, businesses or government agencies who are attempting to do business with a qualified contractor, or by other people who are considering becoming certified. Is the marketing (by the certifier or by the people who have been certified and want to market themselves using it) congruent with the underlying value of the certificate?

The Open Certification Project for Software Testing is in part a reaction to the state of certification in our field. That state is variable–the story is different for each of the certifications available. But I’m not satisfied with any of them, nor are several colleagues. The underlying goals of the Open Certification project are partially to create an alternative certification and partially to create competitive pressure on other providers, to encourage them to improve the value and/or the marketing of their certification(s).
For the next few articles, I want to consider assessment.

1n 1948, a group of “college examiners” gathered at an American Psychological Association meeting and decided to try to develop a theoretical foundation for evaluating whether a person knows something, and how well. The key product of that group was Bloom’s (1956) Taxonomy (see Wikipedia, The Encyclopedia of Educational Technology, the National Teaching & Learning Forum, Don Clark’s page, Teacher Tap, or just ask Google for a wealth of useful stuff). The Bloom Committee considered how we could evaluate levels of cognitive knowledge (distinct from psychomotor and affective) and proposed six levels:

  • Knowledge (for example, can state or identify facts or ideas)
  • Comprehension (for example, can summarize ideas, restate them in other words, compare them to other ideas)
  • Application (for example, can use the knowledge to solve problems)
  • Analysis (for example, can identify patterns, identify components and explain how they connect to each other)
  • Synthesis (for example, can relate different things to each other, combine ideas to produce an explanation)
  • Evaluation (for example, can weigh costs and benefits of two different proposals)

It turns out to be stunningly difficult to assess a student’s level of knowledge. All too often, we think we are measuring one thing while we actually measure something else.

For example, suppose that I create an exam that asks students:

“What are the key similarities and differences between domain testing and scenario testing? Describe two cases, one better suited for domain analysis and the other better suited for scenario, and explain why.”

This is obviously an evaluation-level question, right? Well, maybe. But maybe not. Suppose that a student handed in a perfect answer to this question:

  • Knowledge. Maybe students saw this question in a study guide (or a previous exam), developed an answer while they studied together, then memorized it. (Maybe they published it on the Net.) This particular student has memorized an answer written by someone else.
  • Comprehension. Maybe students prepared a sample answer for this question , or saw this comparison online or in the textbook, or the teacher made this comparison in class (including the explanation of the two key examples), and this student learned the comparison just well enough to be able to restate it in her own words.
  • Application. Maybe the comparison was given in class (or in a study guide, etc.) along with the two “classic” cases (one for domain, one for scenario) but the student has had to figure out for himself why one works well for domain and the other for scenario. He has had to consider how to apply the test techniques to the situations.

These cases reflect a very common problem. How we teach, how our students study, and what resources our students study from will impact student performancewhat they appear to know–on exams even if they don’t make much of a difference to the underlying competence–how well they actually know it.

The distinction between competence and performance is fundamental in educational and psychological measurement It also cuts both ways. In the examples above, performance appears to reflect a deeper knowledge of the material. What I often see in my courses is that students who know the material well underperform (get poor grades on my exams) because they are unfamiliar with strictly-graded essay exams (see my grading videos video#1 and video#2 and slides) or with well-designed multiple choice exams. The extensive discussions of racial bias and cultural bias in standardized exams is another example of the competence/performance discussion–some groups perform less well on some exams because of details of the method of examination rather than because of a difference in underlying knowledge.

When we design an assessment for certification:

  • What level of knowledge does the assessments appear to get to?
  • Could someone who knows less or knows the material less deeply perform as well as someone who knows it at the level we are trying to evaluate?
  • Might someone who knows this material deeply perform less well than we expect (for example, because they see ambiguities that a less senior person would miss)?

In my opinion, an assessment is not well designed and should not be used for serious work, if questions like these are not carefully considered in its design.

Coming soon in this sequence — Assessment Objectives:

  • Anderson, Krathwohl et. al (2001) update the 1956 Bloom taxonomy.
  • Extending Anderson/Krathwohl for evaluation of testing knowledge
  • Assessment activities for certification in light of the Anderson/Krathwohl taxonomy

Next assessment sequence — Multiple Choice Questions: Design & Content.

Updating some core concepts in software testing

Tuesday, November 21st, 2006

Most software testing techniques were first developed in the 1970’s, when “large� programs were tiny compared to today.

Programmer productivity has grown dramatically over the years, a result of paradigmatic shifts in software development practice. Testing practice has evolved less dramatically and our productivity has grown less spectacularly. This divergence in productivity has profound implications—every year, testers impact less of the product. If we continue on this trajectory, our work will become irrelevant because its impact will be insignificant.

Over the past few years, several training organizations have created tester certifications. I don’t object in principle to certification but the Body of Knowledge (BoK) underlying a certificate has broader implications. People look to BoKs as descriptions of good (or at least current) attitudes and practice.

I’ve been dismayed by the extent to which several BoKs reiterate the 1980’s. Have we really made so little progress?

When we teach the same basics that we learned, we provide little foundation for improvement. Rather than setting up the next generation to rebel against the same dumb ideas we worked around, we should teach our best alternatives, so that new testers’ rebellions will take them beyond what we have achieved.

One popular source of orthodoxy is my own book, Testing Computer Software (TCS). In this article, I would like to highlight a few of the still-influential assumptions or assertions in TCS, in order to reject them. They are out of date. We should stop relying on them.

Where TCS was different

I wrote TCS to highlight what I saw as best practices (of the 1980’s) in Silicon Valley, which were at odds with much of the received wisdom of the time:

  • Testers must be able to test well without authoritative (complete, trustworthy) specifications. I coined the phrase, exploratory testing, to describe a survival skill.
  • Testing should address all areas of potential customer dissatisfaction, not just functional bugs. Because matters of usability, performance, localizability, supportability, (these days, security) are critical factors in the acceptability of the product, test groups should become skilled at dealing with them. Just because something is beyond your current skill set doesn’t mean it’s beyond your current scope of responsibility.
  • It is neither uncommon nor unethical to defer (choose not to fix) known bugs. However, testers should research a bug or design weakness thoroughly enough and present it carefully enough to help the project team clearly understand the potential consequences of shipping with this bug.
  • Testers are not the primary advocates of quality. We provide a quality assistance service to a broader group of stakeholders who take as much pride in their work as we do.
  • The decision to automate a test is a matter of economics, not principle. It is profitable to automate a test (including paying the maintenance costs as the program evolves) if you would run the manual test so many times that the net cost of automation is less than manual execution. Many manual tests are not worth automating because they provide information that we don’t need to collect repeatedly.
  • Testers must be able to operate effectively within any software development lifecycle—the choice of lifecycle belongs to the project manager, not the test manager. In addition, the waterfall model so often advocated by testing consultants might be a poor choice for testers because the waterfall pushes everyone to lock down decisions long before vital information is in, creating both bad decisions and resistance to later improvement.
  • Testers should design new tests throughout the project, even after feature freeze. As long as we keep learning about the product and its risks, we should be creating new tests. The issue is not whether it is fair to the project team to add new tests late in the project. The issue is whether the bugs those tests could find will impact the customer.
  • We cannot measure the thoroughness of testing by computing simple coverage metrics or by creating at least one test per requirement or specification assertion. Thoroughness of testing means thoroughness of mitigation of risk. Every different way that the program could fail creates a role for another test.

The popularity of these positions (and the ones they challenged) waxes and wanes, but at least they are seen as mainstream points of view.

Where TCS 3 would (will) be different

TCS Editions 1 and 2 were written in a moderate tone. In retrospect, my wording was sometimes so gentle that readers missed key points. In addition, some of TCS 2’s firm positions were simply mistaken:

  • It is not the primary purpose of testing to find bugs. (Nor is it the primary purpose of testing to help the project manager make decisions.) Testing is an empirical investigation conducted to provide stakeholders with information about the quality of the software under test. Stakeholders have different informational needs at different times, in different situations. The primary purpose of testing is to help those stakeholders gain the information they need.
  • Testers should not attempt to specify the expected result of every test. The orthodox view is that test cases must include expected results. There are many stories of bugs missed because the tester simply didn’t recognize the failure. I’ve seen this too. However, I have also seen cases in which testers missed bugs because they were too focused on verifying “expectedâ€? results to notice a failure the test had not been designed to address. You cannot specify all the results—all the behaviors and system/software/data changes—that can arise from a test. There is value in documenting the intent of a test, including results or behaviors to look for, but it is important to do so in a way that keeps the tester thinking and scanning for other results of the test instead of viewing the testing goal as verification against what is written.
  • Procedural documentation probably offers little training value. I used to believe testers would learn the product by following test scripts or by testing user documentation keystroke by keystroke. Some people do learn this way, but others (maybe most) learn more from designing / running their own experiments than from following instructions. In science education, we talk about this in terms of the value of constructivist and inquiry-based learning. There’s an important corollary to this that I’ve learned the hard way—when you create a test script and pass it to an inexperienced tester, she might be able to follow the steps you intended, but she won’t have the observational skills or insights that you would have if you were following the script instead. Scripts might create a sequence of actions but they don’t create cognition.
  • Software testing is more like design evaluation than manufacturing quality control. A manufacturing defect appears in an individual instance of a product (like badly wired brakes in a car). It makes sense to look at every instance in the same ways (regression tests) because any one might fail in a given way, even if the one before and the one after did not. In contrast, a design defect appears in every instance of the product. The challenge of design QC is to understand the full range of implications of the design, not to look for the same problem over and over.
  • Testers should not try to design all tests for reuse as regression tests. After they’ve been run a few times, a regression suite’s tests have one thing in common: the program has passed them all. In terms of information value, they might have offered new data and insights long ago, but now they’re just a bunch of tired old tests in a convenient-to-reuse heap. Sometimes (think of build verification testing), it’s useful to have a cheap heap of reusable tests. But we need other tests that help us understand the design, assess the implications of a weakness, or explore an issue by machine that would be much harder to explore by hand. These often provide their value the first time they are run—reusability is irrelevant and should not influence the design or decision to develop these tests.
  • Exploratory testing is an approach to testing, not a test technique. In scripted testing, a probably-senior tester designs tests early in the testing process and delegates them to programmers to automate or junior testers to run by hand. In contrast, the exploratory tester continually optimizes the value of her work by treating test-related learning, test design, test execution and test result interpretation as mutually supportive activities that run in parallel throughout the project. Exploration can be manual or automated. Explorers might or might not keep detailed records of their work or create extensive artifacts (e.g. databases of sample data or failure mode lists) to improve their efficiency. The key difference between scripting and exploration is cognitive—the scripted tester follows instructions; the explorer reinvents instructions as she stretches her knowledge base and imagination.
  • The focus of system testing should shift to reflect the strengths of programmers’ tests. Many testing books (including TCS 2) treat domain testing (boundary / equivalence analysis) as the primary system testing technique. To the extent that it teaches us to do risk-optimized stratified sampling whenever we deal with a large space of tests, domain testing offers powerful guidance. But the specific technique—checking single variables and combinations at their edge values—is often handled well in unit and low-level integration tests. These are much more efficient than system tests. If the programmers are actually testing this way, then system testers should focus on other risks and other techniques. When other people are doing an honest and serious job of testing in their way, a system test group so jealous of its independence that it refuses to consider what has been done by others is bound to waste time repeating simple tests and thereby miss opportunities to try more complex tests focused on harder-to-assess risks.
  • Test groups should offer diverse, collaborating specialists. Test groups need people who understand the application under test, the technical environment in which it will run (and the associated risks), the market (and their expectations, demands, and support needs), the architecture and mechanics of tools to support the testing effort, and the underlying implementation of the code. You cannot find all this in any one person. You can build a group of strikingly different people, encourage them to collaborate and cross-train, and assign them to project areas that need what they know.
  • Testers may or may not work best in test groups. If you work in a test group, you probably get more testing training, more skilled criticism of your tests and reports, more attention to your test-related career path, and stronger moral support if you speak unwelcome truths to power. If you work in an integrated development group, you probably get more insight into the development of the product, more skilled criticism of the impact of your work, more attention to your broad technical career path, more cross-training with programmers, and less respect if you know lots about the application or its risks but little about how to write code. If you work in a marketing (customer-focused) group, you probably get more training in the application domain and in the evaluation of product acceptability and customer-oriented quality costs (such as support costs and lost sales), more attention to a management-directed career path, and more sympathy if programmers belittle you for thinking more like a customer than a programmer. Similarly, even if there is a cohesive test group, its character may depend on whether it reports to an executive focused on testing, support, marketing, programming, or something else. There is no steady-state best place for a test group. Each choice has costs and benefits. The best choice might be a fundamental reorganization every two years to diversify the perspectives of the long-term staff and the people who work with them.
  • We should abandon the idea, and the hype, of best practices. Every assertion that I’ve made here has been a reaction to another that is incompatible but has been popularly accepted. Testers provide investigative services to people who need information. Depending on the state of their project, the ways in which the product is being developed, and the types of information the people need, different practices will be more appropriate, more efficient, more conducive to good relations with others, more likely to yield the information sought—or less.

This paper has been a summary of a talk I gave at KWSQA last month and was written for publication in their newsletter. For additional details, see my paper, The Ongoing Revolution in Software Testing, available at

Final Exam for Software Testing 2 Class

Sunday, March 9th, 2003

Final Exam for Software Testing 2 Class

People often ask me about the difference between commercial certification in software testing and university education. I explain that the tester certification exams typically test what people have memorized rather than what they can do.

Several people listen to this politely but don’t understand the distinction that I’m making.

I just finished teaching Software Testing 2 at Florida Tech. This is our first round with the course. Next year’s edition will be tougher. (As we work out the kinks, the course will cover more and better each time we teach it, for at least three more teachings.) Even though the course was relatively easy, I think our course’s final exam illustrates the difference between a certification exam and a university exam.

Comments welcome. Send them to me at

April 17 – 25, 2003
Due April 25, 2003. I will accept late exams – without late penalty— up to 5:00 p.m. on May 1. No exams will be accepted after 5 p.m. May 1. It is essential that you work on this exam ALONE with no input or assistance from other people. You MAY NOT discuss your progress or results with other students in the class.

Use Ruby to build a high volume automated test tool to check the mathematics in Open Office spreadsheet

Total points available = 100

1. Your development of the Ruby program should be test-driven. Use testunit (or runit) to test the Ruby program. Show several iterations in the test-driven development.


2. You will test OpenOffice 1.0 by comparing its results to results you get from Microsoft Excel.

2a Choose five mathematical or financial functions that take one or two parameters

2b. Choose five mathematical or financial functions that take many parameters (at least 3)


3. Your program should generate random inputs that you will feed as parameter values to the functions that you have selected:
For each function, run 100 tests as follows

* Generate the input(s) for this function. The set you use should be primarily valid, but you should try some invalid values as well.
* Determine whether a given input is a valid or invalid input and reflect this in your output
* Evaluate the function in OpenOffice
* Evaluate the function in Excel
* Compare the results
* Determine whether the results are sufficiently close
* Summarize the results, across all 100 tests of this function


4. Now test formulas that combine functions from the 10 functions you have used so far.

4a. Create and test 5 interestingly complex formulas. Evaluate them with 100 tests each, as you did for functions in Part 3.


5 Now test random formulas using the same 10 functions you have used so far.

5a For 100 test cases, randomly create a formula, and randomly generate VALID input data. From here,

* Evaluate the formula in OpenOffice
* Evaluate the formula in Excel
* Compare the results
* Determine whether the results are sufficiently close
* Summarize the results of these 100 tests


6. In questions 4 and 5, you probably discovered that you could supply a function with an input value that was valid, but then the function evaluated to a value that was not valid for the function that took this as input.

For example log (cosine (90 degrees)) is undefined. The initial input (90 degrees) is valid. Cosine evaluates to 0, which is valid, but log(0) is undefined and so cosine(90) is invalid as an input for log.

Describe a strategy that you would use to guarantee that the formula evaluates to a valid, numeric result.