Updating BBST to Version 4.0

July 15th, 2016

Becky Fiedler and I are designing the next generation of BBST. We’ll soon start the implementation of BBST-Foundations 4.0. This post is the first of a series soliciting design advice for BBST.

    1. We’re looking for comments, including criticism and alternative suggestions: We are revising the core topics of BBST-Foundations. This post lays out our current thinking. Some parts are undoubtedly wrong.
    2. We’re looking for recommendation-worthy readings. We point students to reading material. Some is required reading (we test the students on it, and/or make them apply it). Other readings are listed as recommended reading or pointed to in individual slides as references. We’re happy to point to good blog posts, white papers, magazine articles and conference talks, along with the usual books and formal papers. Our criteria are simple:
      1. Is this relevant?
      2. Is it written at a level that our students can understand?
      3. Will they learn useful things from it?

      We probably can’t include all good suggestions, but we do want to point students to a broader set of references, from a broader group of people, than Foundations 3.0.

This is a long article. I split it into 5 sections (separate posts) because you can’t read it all in one sitting. The other parts are at:

Background: What is BBST

Some readers of this post won’t be familiar with BBST. Here’s a bit of background. If you are already familiar with BBST, there’s nothing new here. Skip to the next section.

BBST is short for Black Box Software Testing. “Black box” is an old engineering term. When you analyze something as a “black box”, you look at its interaction with the world. You understand it in terms of the actions it initiates and the responses it makes to inputs. The value of studying something as a black box is the focus. You don’t get distracted by the implementation. You focus on the appropriateness and value of the behavior.

Hung Quoc Nguyen and I developed the first version of BBST right after we published Testing Computer Software 2.0 in 1993. (The 1993 and 1999 versions are the same book, different publishers.) The course became commercially successful. I cotaught the course with several people. Each of them helped me expand my coverage, deepen my treatment, improve my instructional style, and update my thinking. So did the Los Altos Workshops on Software Testing, which Brian Lawrence, Drew Pritsker and I cofounded, and the spinoffs of that, especially AWTA (Pettichord’s Austin Workshops on Software Testing), STMR (my workshops on the Software Test Managers Roundtable), WTST (Fiedler’s and my Workshops on Teaching Software Testing), and the much smaller set of WHETs (Bach’s and my Workshops on Heuristic & Exploratory Techniques).

Fifteen years ago, Bach and I were actively promoting the context-driven approach to software testing together. As part of our co-branding, we started signing BBST as Kaner-and-Bach and he let me incorporate some materials from RST into BBST. Our actual course development paths remained largely independent. Bach worked primarily on RST, while I worked primarily on BBST.

Florida Institute of Technology recruited me to teach software engineering, in 2000. In that setting, Fiedler and I wrote grant proposals and got funding from NSF to expand BBST and turn it into an online course. The “flipped classroom” (lectures on video, activities in the classroom) was invented independently and simultaneously by several instructional researchers/developers. Fiedler and I (with BBST 1.0) were among this earliest publishers within this group (see here and here and here).

We evolved BBST to 3.0 through a collaboration with the Association for Software Testing, and with several individual colleagues (especially, I think, Scott Barber, Doug Hoffman, and John McConda, but with significant guidance from many others including James Bach, Ajay Bhagwat, Bernie Berger, Rex Black, Michael Bolton, Tim Coulter, Jack Falk, Sabrina Fay, Elisabeth Hendrickson, Alan Jorgenson, Pat McGee, Hung Nguyen, Andy Tinkham — please see the opening slides in the Foundations course. It’s a long list.) For more on the history, including parts of the key NSF grant proposal and many other links, please see Evolution of the BBST Courses.

There are 4 BBST courses:

We teach skills along with content. Along with testing skills and knowledge, we teach learning skills, communication skills and computing fundamentals.

To see how we analyze the depth of instruction for a part of the course, see BBST Courses: BBST Learning Objectives. To see how we balance the coverage of learning objectives across the courses, see Learning Objectives of the BBST Courses and then see Emphasis & Objectives of the Test Design Course.

 

Updating to BBST 4.0: Differences Between the Core BBST Courses and Domain Testing

July 15th, 2016

This is the second section of my post on BBST 4.0. The other parts are at:

Differences Between the Core BBST Courses and Domain Testing

Domain Testing focuses on one testing technique (domain testing), which is believed to be the most popular test technique in the known universe. (We hope to develop some additional one-technique-focused courses, but this one had to come first.) In this course, we help students go beyond awareness and basic understanding of the technique, to develop skill with it. Because it is more in-depth, and more focused on doing the technique well than on talking about it, the course structure had to change a bit.

Our most important change was the introduction of application videos, in which we demonstrate parts of the technique on real programs. More details on this below… I mention it here because this is the primary idea from Domain Testing that will come into BBST 4.0.

Structure of the Core BBST Courses

The original instructor-led BBST courses are organized around six lectures (about six hours of talk, divided into one-hour parts), with a collection of activities. To successfully complete a typical instructor-led course, a student spends about 12-15 hours per week for 4 weeks (48-60 hours total).

Most of the course time is spent on the activities:

  • Orientation activities introduce students to a key challenge considered in a lecture. The student puzzles through the activity for 30 to 90 minutes, typically before watching the lecture, then sees how the lecture approaches this type of problem. The typical Foundations course has two to four of these.
  • Application activities call for two to six hours of work. It applies ideas or techniques presented in a lecture or developed over several lectures. The typical Foundations course has one to two of these.
  • Multiple-choice quizzes help students identify gaps in their knowledge or understanding of key concepts in the course. These questions are tough because they are designed to be instructional or diagnostic (to teach you something, to deepen your knowledge of something, or to help you recognize that you don’t understand something) rather than to fairly grade you.
  • Various other discussions that help the students get to know each other better, chew on the course’s multiple-choice quiz questions, or consider other topics of current interest.
  • An essay-style final exam.
  • Interactive Grading. An interactive grading session is a one-on-one meeting, student with instructor, that lasts 1-to-2 hours (occasionally, 3 hours, ouch) and is focused on a specific exam or major assignment. In Foundations, we work through the exam. In Bug Advocacy, we focus on a student’s analysis of a bug report. In Test Design, we focus on the second course assignment, which applies a combination of risk-based testing and domain testing to OpenOffice.

Structure of Domain Testing

The Domain Testing course breaks from the BBST model in four ways:

  • Focus On One Technique: The Domain Testing course teaches one (1) technique. Our goal is to help students develop practical skill with this technique.
  • The workbook is a textbook on domain testing, not a course support book.
  • Application Videos: Along with the core lecture videos, we provide supplementary videos that show how we apply the technique to real software products. This is the primary instructional innovation in this course, and students have praised it highly. Each lesson comes with a set of application videos. Each video demonstrates the technique on one product. We currently work with 4 very different products: a financial application, a complex embedded software system (driving a high-end sewing machine), a graphic design program and a music composition program. As we teach a new aspect of domain testing, students can choose to watch how we apply it to any one (or more) of these programs. Then they apply it to the course’s software under test (currently, GnuCash, which is like an open source QuickBooks.)
  • Capstone Project Instead of an Exam: Because our focus is on skill development, we skip the multiple-choice questions and the exam. Students practiced the pieces of the task in the first three weeks. In the Project, they put it all together. We interactively grade the final project.

Updating to BBST 4.0: Learning Objectives and Structure of Foundations 3.0 (2010)

July 15th, 2016

This is the third section of my post on BBST 4.0. The other parts are at:

Learning Objectives and Structure of Foundations 3.0 (2010)

Most courses present content, help students develop skills, and foster student attitudes. The typical course identifies its central topic area (such as software testing or Java programming) and focuses on that, but also presents other content, skills and attitudes that are not directly on-topic but support the student who studies this area. For example, we see communication skills as important for software testers.

In Foundations 2.0 and 3.0, we decided that the central testing content should be the five most important issues in software testing. In our view, these were:

  1. Information objectives drive the testing mission and strategy (Why do we test? What are we trying to learn and how are we trying to learn it?)
  2. Oracles are heuristic (How can we know whether a program passed or failed a test?)
  3. Coverage is a multidimensional concept (How can we determine how much testing has been done?)
  4. Complete testing is impossible (Are we done yet?)
  5. Measurement is important, but hard (How much testing have we completed and how well have we done it?)

Along the way, to teach these, we have to present definitions and foundational information.

Along with the central topic, we worked on a key set of supporting objectives:

  • Help students develop learning skills that are effective in the online-course context. Most professional-development courses are easy. Even if they present difficult material, they give the students feel-good activities that help the student feel as though they understand the material or can do the basic task, whether they actually learned it or not. This is good marketing (the students walk away feeling successful and happy) but we don’t think it’s good education. We prefer to give students tasks that are more challenging, that helps them understand how well they have learned the content and how well they can apply the skills. Most of our professional students finished their formal schooling years ago and so they are out of practice with academic skills. We have to help them rebuild their skills with effective reading, taking exams, writing assignments, providing peer reviews and coping with critical feedback from peers and instructors.
  • Foster the attitude that our assessments are helpfully hard without being risky
    To be instructionally effective, assessments (exams, quizzes, labs, projects, writing assignments, etc.) should be hard enough that students have to stretch a bit in order to do them well, but not too hard or students will give up.Students come into Foundations with vastly different backgrounds. This is not like a third-year math course, in which the instructor knows that every student has two successful years of math behind them. Our hardest assessment-design goal is to create activities that are challenging for experienced professionals while still being motivating and instructionally useful for newcomers to the field.One of the course-design decisions that comes out of an intention to serve a very diverse group of students is that the assessments should be challenging, feedback should be honest but pass-fail decisions should be generous. Honest feedback sometimes requires direct and detailed criticism. Our goal is to deliver the criticism in a kind way. We are trying to teach these students, not demoralize them. While this is not possible with every student, the goal is to write in a way that helps them understand what is meant, rather than making them so defensive or embarrassed that they reject it.

    • Our intention is that most students who take Foundations and try reasonably hard should “pass” the course. Most of the people who don’t pass the course should be people who abandon it, often because they had to switch their focus to some crisis at work. Someone who has to shift focus (and stop working on BBST) won’t pass the course, but we don’t want to say they “failed” it. That’s not what happened. They didn’t fail. They just didn’t finish. So rather than saying that someone “passed” a BBST course, we say they completed it. Rather than say they “failed” our course, we say that they didn’t complete it.
    • Some students stay in the course to the end but don’t do the work, or do work that is really bad, or submit work that isn’t theirs (otherwise known as “cheating.”) These students don’t complete the course.
    • Our standard, for students who make an honest effort, is this: Bug Advocacy and Test Design are a little harder than Foundations. If, at the end of the course, the instructor believes that a student has no hope of succeeding in the next class, we won’t tell them they “completed” Foundations.
    • Most students who stay in the course do get credit for completing the course even if their work is not up to the average level of the class. They get honest feedback on each piece of work, but if they make an honest effort, we want to give them a “pass.”

Bug Advocacy and Test Design emphasize other supporting objectives, such as persuasive technical communication, time management and prioritization, and organization of complex, multidimensional datasets with simplifying themes. However, for students to have any hope with these, they need the basic learning skills that we try to help them with in Foundations.

 

Updating to BBST 4.0: What Should Change

July 15th, 2016

This is the 4th section of my post on BBST 4.0. The other parts are at:

What We Think Should Change

We are satisfied with the course standards and the supporting objectives (teaching students better learning skills). We want to make it easier for students to succeed by improving how we teach the course, but our underlying criteria for success aren’t going to change.

In contrast, we are not satisfied with the central content. Here are my notes:

1. Information objectives drive the testing mission and strategy

When you test a product, you have a defining objective. What are you trying to learn about the product? We’ll call this your information objective. Your mission is to achieve your information objective. So the information objective is more about what you want to know and the mission is more about what you will do. Your strategy describes your overall plan for actually doing/achieving it.

This is where we introduce context-driven testing. For example, sometimes testers will organize their work around bug-hunting but other times they will organize around getting an exciting product into the market as quickly as possible and making it improvable as efficiently  as possible. The missions differ here, and so do the strategies for achieving them.

Critique:

  • The distinction between information objective and mission is too fine-grained. It led to unhelpful discussions and questions. We’re going to merge the two concepts into “mission.”
  • The presentation of context got buried in a mass of other details. We gave students an assignment, in which we described several contexts and they had to discuss how testing would differ in them. This was too hard for too many students. We have to provide a better foundation in the lecture and reading.
  • We must present more contexts and/or characterize them more explicitly. Back when we created BBST 3.0, we treated some development approaches as ideas rather than as contexts. The world has moved on, and what was, to some degree, hypothetical or experimental, has become established. For example:
    • In BBST 3.0, when we described a testing role that would fit in an agile organization, we avoided agile-development terminology (for reasons that no longer matter). In retrospect, that decision was  badly outdated when we made it. Several different contexts grew out of the Agile Manifesto. Books like Crispin & Gregory illustrate the culture of testing in those situations. In BBST 4.0, we will treat this as a normal context
    • Similarly, a very rapid development process is common in which you ship an early version, polish aggressively based on customer feedback, and make ongoing incremental improvements rapidly. Whether we think this approach is wonderful or not, we should recognize it as a common context. A context-respecting tester who works for, or consults to, a company that develops software this way is going to have to figure out how to provide valuable testing services that fit within this model.
  • Context-driven testing isn’t for everyone. Context-driven testing requires the tester to adapt to the context—or leave. There are career implications here, but we didn’t talk much about them in BBST 3.0. The career issues were often visible in the classes, but there was no place in the course to work on them. This section (mission drives strategy) isn’t the place for them, but the course needs a place for career discussion.

Tentative decisions:

  • Continue with this as the course opener.
  • Merge information objectives and mission into one concept (mission) which drives strategy.
  • Tighten up on the set of definitions. Definitions / distinctions that aren’t directly applicable to the course’s main concepts (such as the distinction between black box testing and behavioral testing) must go away.
  • Create a separate, well-focused treatment of career paths in software testing. (This will become a supplementary video, probably presented in the lesson that addresses test automation.)

2. Oracles are heuristic

An oracle is a mechanism for determining whether a program passed or failed a test. In 1980, Elaine Weyuker pointed out that perfect oracles rarely exist and that in practice, we rely on partial oracles. Weyuker’s work wasn’t noticed by most of us in the testing community. The insight didn’t take hold until Doug Hoffman rediscovered it, presenting his Taxonomy of Test Oracles at Quality Week and then at an all-day session at the Los Altos Workshop on Software Testing (LAWST 5).

Foundations presented oracles from two conflicting perspectives, (a) Hoffman’s (and mine) and (b) James Bach & Michael Bolton’s.

  • We started with an example from Bach’s Rapid Software Testing course, then presented the Bach and Bolton’s “heuristic oracles” (or “oracle heuristics”). A heuristic is a fallible but reasonable and useful decision rule. The heuristic aspect of a heuristic oracle is the idea that the behavior of software should usually (but not always) be consistent with a reasonable expectation. Bach developed a list of reasonable expectations, such as the expectation that the current version of the software will behave similarly to a previous version. This is usually correct, but sometimes it is wrong because of design improvement. Thus it is a heuristic. After the explanation, students worked through a challenging group assignment.
  • Next, we presented Hoffman’s list of partial oracles and mentioned that these are useful supports for automated testing. We gave students some required readings, plus quiz questions and exam study guide questions but no assignment.

Critique:

This material worked well for RST. It did not work well in BBST. I published a critique: The Oracle Problem and the Teaching of Software Testing. and a more detailed analysis in the Foundations of Software Testing workbook. Here are some of my concerns:

  • The terminology created at least as much confusion as insight. When we tested them, students repeatedly confused the meaning of the word “oracles” and the meaning of the word “heuristics.” Many students treated the words as equivalent and made statements like “all heuristics are oracles” (which is, of course, not true).
  • The terminology is redundant and uninformative. Saying “heuristic oracle” is like saying “testly test.” The descriptor (heuristic, or testly) adds no new information. The word heuristic has been severely overused in software testing. A fallible decision rule is a heuristic. A cognitive tool that won’t always achieve the desired result is a heuristic. A choice to use a technological tool that won’t always achieve the desired result is a heuristic. Every decision testers make is rooted in heuristics because all of our decisions are made under uncertainty. Every tool we use can be called a heuristic because they are all imperfect and, given the impossibility of complete testing and the impossibility of perfect oracles, every tool we will ever use will be imperfect. This isn’t unique to software testing. Long before we started talking about this in software testing, Billy V. Koen’s award-winning introductions to engineering (often taught in general engineering courses) pointed this out that engineering reasoning and methods are rooted in heuristics. See Koen’s work here (my favorite presentation) and here and here. There is nothing more heuristic-like about oracles than there is about any other aspect of test design, or engineering in general.
  • Heuristic is a magic word. Magic words provide a name for something while relieving you of the need to think further about it. The core issue with oracles is not that they are fallible. The core issue is that they are incomplete.
    • The nature of incompleteness: An oracle will focus a test on one aspect of the software under test (or a few aspects, but just a few). Testers (and automated tests) won’t notice that the program behaves well relative to those aspects but misbehaves in other ways. Therefore, you can notice a failure if it is one that you’re looking for, but you never know whether the program actually passed a test. You merely learn that it didn’t fail the test.
    • Human observers don’t eliminate this incompleteness. If you give a test to a human observer, with a well-defined oracle, they are likely to behave like a machine. They pay attention to what the oracle steers them to and don’t notice anything else. The phenomenon of not noticing anything else has been studied formally (inattentional blindness). If you don’t steer the observer with an oracle, you can get more diversity. If multiple people do the testing, different people will probably pay attention to different things, so across a group of people you will probably see greater coverage of the variety of risks. However, each individual is a finite-capacity processor. They have limited attention spans. They can only pay attention to a few things. When people split their attention, they become less effective in each task. An individual observer can introduce variation by paying attention to different potential problems in different runs of the same test.
    • I don’t think you achieve effective oracle diversity by choosing to “explore” (another magic word, though one that I am more fond of). I think you achieve it by deliberately focusing on some things today that you know you did not focus on yesterday. We can do that systematically by intentionally adopting different oracles at different times. Thinking of it this way, oracle diversity is at least as much a matter of disciplined test design as it is a basis for exploration. (I learned this, and its importance for the design of suites of automated tests, from Doug Hoffman.)
    • No aspect of the word “heuristic” takes us into these details. The word is a distraction from the question: How can we compensate for inevitable incompleteness, (a) when we work with automated test execution and evaluation and (b) when we work with human observers?
  • There are two uses of oracles. We emphasized the one that is wrong for Foundations:
    1. The test-design oracle: An oracle is an essential part of the design of a test, or of a suite of tests. This is especially important in the design of automated tests. The oracle defines what problems the tests can see (and what problems they are blind to). Anyone who is learning how tests are designed should be learning about building suites of imperfect tests using partial oracles.
    2. The tester-expectations oracle: An oracle can be a useful component of a bug report. When you find a bug, you usually have to tell someone about it, and the telling often includes an explanation of why you think this behavior is wrong, and how important the wrongness is. For example, suppose you say, “this is wrong because it’s different from how it used to be.” You are relying on an expectation here and treating it as an oracle. The expectation is that the program’s behavior should stay the same unless the program is intentionally redesigned.

    James Bach originally developed his list of tester expectations as a catalog of patterns of explanations that testers gave when they were asked to explain their bug reports. He recharacterized them as oracle heuristics (or heuristic oracles) years later. We think these are useful patterns, and when we create Bug Advocacy 4.0, we will probably include the ideas there (but maybe without the fancy oracular heuristical vocabulary).

  • In Foundations, Bach and Bolton’s excellent presentation of tester-expectation oracles drowned out most students’ awareness of test-design oracles. That is, what we see on Foundations 3.0 exams is that many students forget, or completely ignore, the course’s treatment of partial oracles, remembering only the tester-expectation oracles.

Tentative decisions:

  • All of the “heuristic” discussion of oracles is going away from Foundations.
  • We will probably introduce the notion of heuristics somewhere in the BBST series. There is value in teaching people that testers’ reasoning and decisions are rooted in heuristics, and that this is normal in all applied sciences. However, we will teach it as its own topic. It creates a distraction when pasted onto oracles.
  • We will tie the idea of partial oracles much more tightly to the design of suites of automated tests. We will focus on the problem that all oracles are incomplete, then transition into the questions:
    • What are you looking for with this (test or) series of tests?
    • What are these tests blind to?
    • How will you look for those other potential problems?
    • Can you detect these potential problems with code?
    • If not, can you train people to look for them?
    • Which problems are low-priority enough, or low-likelihood enough, that you should choose to run general exploratory tests and rely on luck and hope to help you stumble across them if they are there?
  • Years ago, many test tool vendors (and many consultants) promised complete test automation from tools that were laughably inadequate for that purpose. These people promised “complete testing” that would replace all (or almost all) of your expensive manual testers. Several of us (including me) argued that these tools were weaker than promised. We pointed out that the tools provided a very incomplete look at the behavior of the software and even that often came with an excessively-high maintenance cost. These discussions were often adversarial and unpleasant. Sometimes very unpleasant. Arguments like these leave mental scars. I think these old scars are at the root of some of  the test-automation skepticism that we hear from some of our leading old-timers (and their students). I think BBST can present a much more constructive view of current automation-support technology by starting from the perspective of partial oracles.
  • Knott’s book on Mobile Application Testing provides an excellent overview of the variety of test-automation technologies, contrasting them in terms of the basis they use for evaluating the software (e.g. comparison to a captured screen) and the costs and benefits of that basis. These are all partial oracles. They come with their own costs, including tool cost, implementation difficulty, and maintenance cost and they come with their own limitations (what you can notice with this type of oracle and what you are blind to). I think this presentation of Knott’s, combined with updated material from Hoffman will probably become the core of the new oracle discussion.
  • I think this is where we’ll treat automation issues more generally, discussing Cohn’s test automation pyramid (see Fowler too) (the value of extensive unit testing and below-the-UI integration testing) and Knott’s inverted pyramid. If I spend the time on this that I expect to spend, then along with describing the pyramid and the inverted pyramid and summarizing the rationales, I expect to raise the following issues:
    • For traditional applications (situations in which we expect the pyramid to apply), I think the pyramid model substantially underestimates the need for end-to-end testing. For example:
      • Performance testing is critical for many types of applications. To do this well, you need a lot of automated, end-to-end tests that model the behavior patterns of different categories of users.
      • Some bugs are almost impossible to discover with unit-level or service-level automated tests or with manual end-to-end tests or with many types of automated end-to-end tests. In my experience, wild pointers, stack overflows, race conditions and other timing problems seem to be most effectively found by high-volume automated system-level tests.
      • Security testing seems to require a combination of skilled manual attacks and routinized tests for standard vulnerabilities and high-volume automated probes.
      • As we get better at these three types of automated system-level tests, I suspect that we will develop better technology for automated functional tests.
    • For mobile apps, Knott makes a persuasive case that the pyramid has to be inverted (more system-level testing). This is because of an explosion of configuration diversity (significant variations of hardware and system software across phones) and because some tasks are location-dependent, time-dependent, or connection-dependent.
      • I think there is some value in remembering that problems like this happened in the Windows and Unix worlds too. In DOS/Windows/Apple, we had to test application compatibility with printers printer-by-printer in each application. Same for compatibility with video cards, keyboards, rodents, network interfaces, sound cards, fonts, disk formats, disk drive interfaces, etc. I’ve heard of similar problems in Unix, driven more by variation in system hardware and architecture than by peripherals. Gradually, in the Windows/Apple worlds, the operating systems expanded their reach and presented standardized interfaces to the applications. Claims of successful standardization were often highly exaggerated at first, but gradually, variances between printer models (etc.) have become much simpler, smaller testing issues. We should look at the weakly-managed device variation in mobile phones as a big testing problem today, that we will have to develop testing strategies to deal with, while recognizing that those strategies will become obsolete as the operating systems mature. I think the historical perspective is important because of the risk that the mobile testers of today will become the test managers of tomorrow. There is a risk that they will manage for the problems they matured with, even though those problems have largely been solved by better technology and better O/S design. I think I see that today, with managers of my generation, some of whom seem still worried about the problems of the 1980’s/1990’s. Today’s students need historical perspective so that they can anticipate the evolution of system risks and plan to evolve with them.
      • How does this apply to the inverted pyramid? I think that it is inverted today as a matter of necessity, but that 20 years from now, the same automation pyramid that applies to other applications will apply equally well to mobile. As the systems mature, the distortion of testing processes that we need to cope with immaturity will gradually fall away.
  • Rather than arguing about whether automation is desirable (it is) and whether it is often extremely useful (it is) and whether it is overhyped (it is) and whether simplistic strategies that won’t work are still being peddled to people not sophisticated enough to realize what trouble they’re getting into (they are), I want people to come out of Foundations with a few questions. Here’s my current list. I’m sure it will change.:
    • What types of information will we learn when we use this tool in this way?
    • What types of information are we missing and how much do we care about which types?
    • What other tools (or manual methods) can we use to fill in the knowledge gaps that we’ve identified?
    • What are the costs of using this technology?
    • Every added type of information comes at a cost. Are we willing to invest in a systematic approach to discovering a type of information about the software and if so, what will the cost be of that systematic approach? If the systematic approach is not feasible (too hard or too expensive), what manual methods for discovering the information are available, how effective are they, how expensive are they and how thorough a look are we willing to pay for?

Please send us good pointers to articles / blog posts that Foundations students can read to learn more about these questions (and ways to answer them). The course itself will go over this lightly. To address this in more depth would require a whole course, rather than a one-hour lecture. Students who are interested in this material (as every wise student will be) will learn most of what they learn on the topic  from the readings.

A supplementary video:

The emphasis on automation raises a different cluster of issues having to do with the student’s career path. There used to be a solid career for a person who knew basic black box testing techniques and had general business knowledge. This person tested at the black box level, doing and designing exploratory and/or scripted tests. I think this will gradually die as a career. We have to note the resurgence of pieceworking, a type of employment where the worker is paid by the completed task (the “piece”) rather than by the time spent doing the task. If you can divide a task into many small pieces of comparable difficulty, you can spread them out across many workers. In general, I think pieceworking provides low incomes for people who are doing tasks that don’t require rare skills. Take a look at Soylent, “an add-in to Microsoft Word that uses crowd contributions to perform interactive document shortening, proofreading, and human-language macros.” Look over their site and read their paper. Think forward ten years. How much basic, system-level testing will be split out as piecework rather than done in-house or outsourced?

I’m not advocating for this change. I’m looking at a social context (that’s what context-driven people do) and saying that it looks like this is coming down the road, whether we like it or not.

There is already a big pay differential between traditional black box testers and testers who write code as part of their job. At Florida Tech, I see a big gap in starting salaries of our students. Students who go into traditional black box roles get much lower offers. As pieceworking gets more popular, I think the gap will grow wider. I think that students who are training today for a job as a traditional black-box tester are training for a type of job that will vanish or will stop paying a middle-class wage for North American workers. I think that introductory courses in software testing have a responsibility to caution students that they need to expand their skills if they want a satisfactory career in testing.

Some people will challenge my analysis. Certainly, I could be wrong. As Rob Lambert pointed out recently, people have been talking about the demise of generalist, black-box testers for years and years and they haven’t gone away yet. And certainly, for years to come, I believe there will be room for a few experts who thrive as consultants and a few senior testers who focus strictly on testing but are high-skill designers/planners. Those roles will exist, but how many?

Given my analysis (which could be wrong), I think it is the responsibility of teachers of introductory testing courses to caution students about the risk of working in traditional testing and the need to develop additional skills that can market well with an interest in testing.

  • Combining testing and programming skill is the obvious path and the one that probably opens the broadest set of doors.
  • Another path combines testing with deep business knowledge. If you know a lot about actuarial math and about the culture of actuaries, you might be able to provide very valuable services to companies that develop software for actuaries. However, your value for non-actuarial software might be limited.
  • For some types of products or services, you might add unusual value if you have expertise in human factors, or in accessibility, or in statistical process control, or in physics. Again, you offer unusual value for companies that need your combination of skills but perhaps only generic value for companies that don’t need your specific skills.
  • I don’t think that a quick, informed-amateur-level survey of popular topics in cognitive psychology will provide the kind of skills or the depth of skills that I have in mind.

It’s up to the student to figure out their career path, but it’s up to us to tell them that a career that has been solid for 50 years is changing character in ways they have to prepare for.

I’ve been trying to figure out how to focus a Lesson on this, or how to insert this into one of the other Lessons. I don’t think it will fit. There aren’t enough available hours in a 4-week online course. However, just as we gave application videos to students in the Domain Testing course (and some students watched only one video per Lesson while others watched several), I think I can create a watch-this-if-you-want-to video on career path that accompanies the lecture that addresses test automation.

3. Coverage is a multidimensional concept

My primary goal in teaching “coverage” was to open students’ eyes to the many different ways that they can measure coverage. Yes, there are structural coverage measures (statement coverage, branch coverage, subpath coverage, multicondition coverage, etc.) but there are also many black-box measures. For example, if you have a requirements specification, what percentage of the requirements items have you tested the program against?

Coverage is a measure, normally expressed as a percentage. There is a set of things that you could test: all the lines of code, or all the relevant flavors of smartphones’ operating systems, or all the visible features, or all the published claims about the program, etc. Pick your set. This is your population of possible test targets of a certain kind (target: thing you can test). The size of your set (the number of possible test targets) is your denominator. Count how many you’ve actually tested: that’s your numerator. Do the division and you get the proportion of possible tests of this kind that you have actually run. That’s your coverage measure. Statement coverage is the percentage of statements you’ve tested. Branch coverage is the percentage of branches you’ve tested. Visible-feature coverage is the percentage of visible features that you’ve tested thoroughly enough to say, “I’ve tested that one.”

Complete coverage is 100% coverage of all of the possible tests. The program is completely tested if there cannot be any undiscovered bugs (because you’ve run all the possible tests). For nontrivial programs, complete testing is impossible because the population of possible tests is infinite. So you can’t have complete coverage. You can only have partial coverage. And at that point, we teach you to subdivide the world into types of test targets and to think in terms of many test coverages. My main paper on testing coverage lists 101 different types of coverage. There are many more. Maybe you want to achieve 95% statement coverage and 2% mouse coverage (how many different rodents do you need to test?) and 100% visible feature coverage and so on. Evaluating the relative significance of the different types of coverage gives you a way to organize and prioritize your testing.

I had generally positive results teaching this in Foundations 1.0 and 2.0, but it was clear that some students didn’t understand programming concepts well enough to understand what the structural coverage measures actually were or how they might go about collecting the data. Around 2008, I realized that many of my university students (computer science majors) had this problem too. They had studied control structures, of course, but not in a way that they could easily apply to coverage measurement. In addition, they (and my non-programmer practitioner students) had repeatedly shown confusion about how computers represent data, how rounding error comes about in floating point calculations, why rounding error is inherent in floating point calculations, etc. These confusions had several impacts. For example some students couldn’t fathom overflows (including buffer overflows). Many students insisted that normal floating-point rounding errors were bugs. I had seen mistakes like these in real-life bug reports that damaged the credibility of the bug reporter. So, we decided that for Foundations 3.0, we would add basic information about programming and data representation to the course.

Critique:

  • The programming material was very helpful for some students. For university students, it built a bridge between their programming knowledge and their testing lessons. Confusions that I had seen in Foundations 2.0 simply went away. For non-programmer professional students, the results were more mixed. Some students gained a new dimension of knowledge. Other students learned the material well enough to be tested on it but considered it irrelevant and probably never used it. And some students learned very little from this material, whined relentlessly and were irate that they had to learn about code in a course on black box testing.
  • The programming material overshadowed other material in Lessons 4 and 5. Some students paid so much attention to this material that they learned it adequately, but missed just about everything else. Some students dropped the course out of fear of the material or because they couldn’t find enough time to work through this material at their pace.
  • Many students never understood that coverage is a measurement. Instead, they would insist that “statement coverage” means all statements have been tested, and generally, that X coverage means that all X’s have been tested. I think we could have done a much better job with this if we had more time.
  • The lecture also addressed the risk of driving testing to achieve high coverage of a specific kind. I taught Marick’s How to Misuse Code Coverage, which describes the measurement dysfunction that he saw in organizations that pushed their testing to achieve very high statement/branch coverage. The students handled this material well on exams. The programming instruction probably helped with this.

Tentative decisions:

  • Return to a lecture that emphasizes the measurement aspects of coverage, the multidimensional nature of coverage, and the risks of driving testing to achieve high coverage (rather than high information).
  • Reduce the programming instruction to an absolute minimum. We need to teach a little bit about control structures in order to explain structural coverage, but we’ll do it briefly. This will create the same problems for students as we had in Foundations 1.0 and 2.0.
  • Create a series of supplementary videos that describe data representation and control structures. I’ll probably use the same lecture material in my Introduction to Programming in Java (Computer Science 1001). We’ll encourage students to watch the videos, probably offer a quiz on the material to help them learn it, maybe offer a bonus question on the exam, but won’t require the students to look at it.

4. Complete testing is impossible

This lecture teaches students that it is impossible to completely test a program. In the course of this instruction, we teach some basic combinatorics (showing students how to calculation the number of possible tests of certain kinds) and expose them again, in a new way, to the wide variety of types of test targets (things you can test) and the difficulty of testing all of them. We teach the basics of data flows, the basics of path testing, and provide a real-life example of the kinds of bugs you can miss if you don’t do extensive path testing.

Critique:

  • Overall, I think this material was OK in Foundations 3.0.
  • The lecture introduces a strong example of a real-life application of high-volume automated testing but provides only the briefest introduction to the concept of high-volume automated testing. The study guide (list of potential exam questions) included questions on high-volume automated testing but we were reluctant to ask any of them because we treated the material too lightly.

Tentative decisions:

  • Cover essentially the same material.
  • Integrate more tightly with the preceding material on the many types of coverage.
  • Add a little more on high-volume automated testing.

5. Measurement is important, but difficult

The lecture introduces students to the basics of measurement theory, to the critical importance of construct validity, to the prevalent use of surrogate measures and the difficulties associated with this, and to measurement dysfunction and distortion. We illustrate measurement dysfunction with bug-count metrics.

Critique:

  • Overall, the section was reasonably effective in conveying what it was designed to convey, but that design was flawed.
  • The lecture exuded mistrust of bad metrics. It provided no positive inspiration for doing good measurement and very little constructive guidance. Back in 2001, Pettichord and Bach and I wrote, “Metrics that are not valid are dangerous.” We inscribed the words in Lessons Learned in Software Testing and in the statement of the Context-Driven Testing Principles.
    • Back in the 1990’s (and still in 2001), we didn’t have much respect for the metrics work that had been done and we thought that bright people could easily make a lot of progress toward developing valid, useful metrics. We were wrong. The next dozen years taught me some humility. The context-driven testing community (and the broader software engineering community) made remarkably little progress on improving software-related measurement. Some of us (including me) have made some unsuccessful efforts. I don’t see any hint of a breakthrough on the horizon. Rather, I see more consultants who simply discourage clients and students from taking measurements or who present oversimplified descriptions of qualitative measurement that don’t convey the difficulties and the cost of doing qualitative measurement well. This stuff plays well at conferences but it’s not for me.
      • I am against bad measurement (and a lot of terrible measurement is being advocated in our field, and students need some help recognizing it). But I am not against measurement. Back in the 1990s and early 2000s, I discouraged people from taking measurement, because I thought  a better set of metrics was around the corner. It’s not. Eventually, my very wise friend, Hung Quoc Nguyen counseled me that it was time for me to rethink my position. He argued that the need for data to guide management is still with us — that if we don’t have better measures available, we need to find better ways to work with what we’ve got. I have come to agree with his analysis.
      • Many people in our field are required to provide metrics to their management. We can help them improve what they provide. We can help them recognize problems and suggest better alternatives. But we aren’t helping them if we say that providing metrics is unethical and they need to say refuse to do it, even if that costs them their jobs. I wrote about this in Contexts differ: Recognizing the difference between wrong and Wrong and then in Metrics, Ethics, & Context-Driven Testing (Part 2).
      • Becky Fiedler and I have used qualitative approaches several times. We’re enthusiastic about them. See our paper on Putting the Context in Context-Driven Testing, for example. But we have no interest in misrepresenting the time and skill required for this kind of work or the limitations of it. This is a useful set of tools, but it is not a collection of silver bullets.

Tentative decisions:

  • We have to fundamentally rework this material. I will probably start from two places:
  • I think these provide a better starting point for a presentation on metrics that:
    • Introduces basic measurement theory
    • Alerts people to the risks associated with surrogate measures, weak validity, and abusive use of metrics
    • Explains measurement distortion and dysfunction
    • Introduces basic qualitative measurement concepts
    • Lays out the difficulties of doing good measurement, peeks at the state of metrics in some other fields (not better than ours), and presents a more positive view on ways to gain useful information from imperfect statistics.
  • The big challenge here, the enormous challenge, will be fitting this much content into a reasonably short lecture.

Updating to BBST 4.0: Financial Model & Concluding Thoughts

July 15th, 2016

This is the 5th section of my post on BBST 4.0. The other parts are at:

Financial Model

BBST 3.0 was developed with support from the National Science Foundation. That funding supported many student assistants and the logistics of the Workshops on Teaching Software Testing. It also paid for a portion (substantial, but well under half) of the time Becky and I spent on BBST development and the equipment and software we used to create it. We also received financial support from the Association for Software Testing and from Florida Institute of Technology. We all understood and agreed that with this level of public support, BBST should be open source.

We also had a vision of BBST becoming self-supporting, like Linux. Some of our thoughts are published here and here and here. That was an ambitious idea. I pursue many ambitious ideas. Some become reality, others don’t. This one didn’t.

BBST 3.0 will continue to be available under a Creative Commons License. We expect some organizations will continue teaching it. We think it’s still a good course, and we’re proud of it. To increase the lifespan for BBST 3.0, we spent an enormous amount of time over the past three years creating the BBST Workbooks. These update the assessment materials and some of the content significantly. The time we spent on BBST 3.0 Workbooks was opportunity cost for BBST 4.0. With limited time, we couldn’t start BBST 4.0 until now (now that the Test Design Workbook has gone to press).

For the future, BBST has to become a traditional commercial course. We just don’t see a way to create it or teach it without charging for it. Becky and I created Kaner, Fiedler & Associates, LLC to take over the maintenance and evolution of BBST. This is a business. It needs income, which we can then spend on instructors and on BBST.

We would very much appreciate your guidance for BBST 4.0, but please understand that the help you give us is going to a commercial project, not to an open source one.

Concluding Thoughts

The most difficult two challenges in instructional design are:

  • Assessment: Figuring out what the students understand, using activities (including exams) that help students learn more, and learn more deeply, during the assessment process
  • Focus: Limiting the scope of the course so that you can (a) fit the material in without cramming it in and (b) teach the material deeply enough that the students can get practical value from it.

Moving the programming content into supplementary instruction makes room for more content. The new content that we think is most important is a sympathetic survey of test automation, but with a frank presentation of the limitations of the common oracles. (We’re going to rely heavily on a great survey by Knott.) We can’t fit everything in that we want to include. Intimately connected to automation, in our view, is discussion of career path. We can’t fit both into the course, there’s just not enough time. Our tentative decision is to make the career path segment supplementary — we’re pretty confident that people who want this will get to it and will get what value they can from it. The automation material will be harder, which means we should support it with assessment.

We’d love to hear your thoughts on this. BBST Foundations / Bug Advocacy / Test Design took almost 6000 hours of our development time (not counting the time spent on the workbooks). Foundations 4.0 will take a lot of effort. We’d like to produce something that creates value for the profession. Your guidance will make that more likely.

An Update to BBST

June 27th, 2016

Rebecca Fiedler and I have just completed a major round of updates to BBST, the Black Box Software Testing course. This creates what we consider a stable release, which we expect to be the final release of BBST Version 3. I am starting to work on Version 4 (initial discussion post: next week).

The course videos are unchanged. The home page for BBST (with the best course descriptions) is still http://bbst.info. Foundations, Bug Advocacy, and Test Design are still available for free at http://testingeducation.org/BBST along with course slides and other supporting materials.

These are the videos that the Association for Software Testing helped develop, and teaches from. I will probably continue to use these in the Software Engineering degree programs at Florida Tech for a few years. For now, we teach from these videos in private courses for commercial clients, through Kaner Fiedler Associates and in conjunction with Altom, which offers the latest updates to our courses to the public.

My course version identifier will stay at Version 3. (Foundations (2010) offers the third generation of Foundations videos, Bug Advocacy (2008) videos are second generation, Test Design (2011) are mixed, second and third. Collectively: Version 3).

Here’s what’s new:

(1) We published course workbooks for each of the three courses. The workbooks provide:

  • An edited transcript of each lecture
  • Updated (or completely replaced) activities
  • Detailed feedback for most activities, based on patterns in student work that we’ve seen over the 12 years since the first BBST course videos.
  • Often-detailed commentary on the lectures and activities. Our primary focus in these notes is the instructional effectiveness of the materials: treat the intended content and skill development as given. How well does the courseware support them? Occasionally, we felt compelled to acknowledge some other issues in the field that we decided were unignorably relevant to the courses.

The workbooks are:

I am making the final publishing-support revisions for Test Design this week. It should be available at Amazon before the end of July.

These books run 230-to-400 pages long. They took a lot of work.

(2) We have overhauled the multiple-choice review questions in all three courses.

These were a source of some irritation in BBST. The new questions are still hard, but student feedback is that they are much better.

(3) We revised our model for instructor-student feedback.

When we planned Version 3, one of our key concerns was that instructors were volunteers. There is only so much work you can give to volunteers before they burn out. We designed the course-interaction model to have as much peer review as possible. This would give students feedback on the quality of their work while keeping the instructor workload under control. Overall, that model has worked pretty well.

  • Feedback quality is, of course, variable. The feedback comes from people who don’t necessarily understand the material themselves. They are learning as much from giving the feedback as the students who receive it.
  • We had some problems with bullying from some students who seemed more motivated to insist on their doctrinaire views than to learn anything new. This was a nuisance to manage when it arose, but I think most instructors figured out how to handle it well enough.
  • Students who needed additional personal feedback often received private feedback via instructor emails or Skypes. My role as an instructional coach for AST completed a few years ago, but my sense is that AST’s instructors are pretty attentive to this need.

The same concerns for volunteer-taught courses apply to large-enrollment courses in universities and colleges. Detailed personalized feedback is just not an instructional option.

However, students’ expectations are appropriately different when they attend commercial versions of the course (paying the usual rates for commercial instruction) or for small-enrollment university courses. They expect better feedback.

Starting in about 2011, Professor Keith Gallagher helped me adapt his ideas on interactive grading to my courses. An interactive grading session is a one-on-one meeting, student with instructor, that lasts 1-to-2 hours (occasionally, 3 hours, ouch) and is focused on a specific exam or major assignment. Usually the focus is on the course’s key piece of work. The intent of the session is coaching, identifying ways the student can improve their work and praising the strengths. Some sessions, of course, are more difficult. In the commercial BBST courses,

  • We offer one interactive grading session per student. There are usually two instructors (sometimes three) in the commercial courses. The student typically picks which instructor to meet with.
  • We draw student attention to the feedback available on assignments in the course Workbook. For assignments that are not well-covered in the Workbook, the instructor writes a detailed class-wide feedback. We provide personalized feedback on aspects of the student’s work that are not covered in the general feedback.
  • The instructor provides summary grading for each piece of work (exceptional, acceptable, disappointing, not-submitted or no-value). (Most work is graded “acceptable.”)

I don’t think this level of feedback is sustainable in a volunteer-taught course. It is more easily sustainable if the instructor uses the Workbook as a course text and can rely on the students having their own copy. But it is still more work than I would ever want to assign to volunteers.

If you are teaching or taking BBST, we suggest that you use the Workbooks

The Workbooks are not Creative Commons licensed. They are not free, but they are inexpensive. When we originally planned these, we intended to publish through commercial publishers, but the prices they planned for the Workbooks were way too high. We created Context-Driven Press so that we could control the quality and price of the books. You can get the books from Amazon for less than $25 each (much less, for Kindle editions).

If you are teaching a BBST course, we strongly recommend that you use the course Workbook as a course text. It changes the student experience, and the feedback we continually get is that the change is strongly positive.

I think these updates mark the completion of this version of BBST.

Online BBST started in 2004. It reflects my understanding of the industry circa 2001 (when we published Lessons Learned in Software Testing.) BBST Version 3 presents the same ideas. The polish from Versions 1 through 3 came more in the instructional design than in the perspective. I am proud of Version 3, and I think it will provide a useful foundation for several years to come. However, our field’s methods, technologies, and social structures have evolved. I think it is time for some bigger changes in what I teach.

There is a fourth course in the BBST series, BBST-Domain Testing. This is currently available only through Kaner Fiedler Associates, though I suspect Altom will offer it soon. The Domain Testing course includes the core sequence of videos, but also includes a set of supplementary videos that have demonstrations and supporting discussions. We spent more time on the supplementary videos than on the core sequence. In BBST Version 4, we intend to exploit some of the opportunities that supplementary videos provide.

I’ll be posting some notes on (Rebecca’s and my ideas for) Foundations 4.0, and asking for your help in planning the update. Those will come in about a week.

Schools of Software Testing: A Debate with Rex Black

August 24th, 2015

Last year, Rex and I had a debate at STPCon on the legitimacy and value of the concept of “schools of software testing.” We recently obtained a recording of the debates and merged it with slides to create a video. You can find additional slides and notes here: Kaner’s STPCon Debate Slides.

Rex Black expressed a few gripes about the concept of “schools” and about the way this concept has been applied in our field.

Schools or Strategies?

Rex’s first point — I think the central point of his presentation — is to acknowledge that there are different approaches to software testing but to argue that we should think of these as differences in strategy rather than divisions into schools. In his view (as I understand it), different people have different preferred strategies for dealing with testing situations. Some people can shift among strategies, choosing the best one for a given situation. As Rex sees it, looking at our field as a collection of schools is divisive and counterproductive.

I think his idea of conflicting (or alternative) strategies is appealing — plausible, but incomplete in a fundamental way.

The problem is that it ignores the social dynamics of the field, which is exactly what we are trying to capture with the idea of “schools.”

People tend to cluster. They find other people whose views or whose personal styles are compatible with theirs. They learn more from people in their cluster, they pay more attention to them, listen more closely to their advice and criticisms. Sometimes people cluster around an intellectually coherent point of view and organize their thinking about their work in terms of that view. At that point, we have the beginnings of a school. It is not just a strategy; it is an approach that is supported by a strong peer group.

This basic kind of clustering is so common that we barely notice it. Sometimes it becomes more pronounced and several of the clusters become more broadly influential.

Fields tend to swing between extremes of high (apparent) cohesiveness to high fragmentation. The evolution takes time.

  • At the high-apparent-cohesiveness extreme, everyone agrees (or pretends to agree) with a dominant view. There is not much controversy. Progress is incremental and not very creative.
  • At the high-fragmentation extreme, people have stopped listening to each other. They squander their creativity on better ways to promote an approach that they see as The One True Way, and to insult or shout down anyone who disagrees with it. There isn’t much progress at this extreme either. People are too busy scoring points about the basics of the field (or the basics of their controversy).

Several authors identify these extremes and describe them as unproductive. Neither extreme promotes an attitude of paying constructive attention to other views and gaining insights from them or taking risks to develop a new approach. (See my slides and notes for references.)

Between the extremes, you have creative tension and a lot of research (or skill development) that tries to get to the factual questions: what works, what happens, what costs, what benefits, what else can be done?

The idea that fields often organize themselves into schools is not controversial. It’s not something special to software testing. You see it in education, business, psychology, physics (etc., etc.)

It’s also common for the members of the dominant school to see themselves as the entire field. They often see other groups that try to differentiate themselves from the main stream as self-promoting spinoffs, as advocates for minor variations from core views that “everyone” shares. One of the reasons that people will intentionally form and announce a school is to create a rallying point. They want a place where likeminded people can share views without being drowned out by the dominating majority, a platform for publishing and refining their set of ideas.

These rallying points are even more important when the field is engaging in political work. When I say “political,” I mean anything involving power and control. For example, standards committees are political. My experience with the IEEE software engineering standards committees is that I can become a member but my views will have no impact on the standard. The way that people who hold minority views gain more impact is to organize, so that many people together intentionally say the same things. This has an impact. For example, agile approaches (I think of contextual thinking and exploratory testing/development as agile approaches) are much more acceptable than they were 20 years ago. That is largely because of advocacy by many people, speaking together. Political work requires political action.

You can hear more about social dynamics in the debate.

Unfortunate Misbehavior

Rex’s central argument is that the characterization of the field in terms of conflicting schools is inaccurate and would be better replaced by a description of alternative strategies. Along with this argument comes a complaint, which I see as the emotional charge behind his argument. He complains about harsh statements from some people who call themselves Context-Driven and call themselves leaders of the Context-Driven School. I think he’s well-justified in feeling that some people are behaving badly and that they have treated him badly.

If you see yourself as a member of the Context-Driven School, let me suggest that as individuals, we get to choose how far we go down the path of divisiveness:

  • We can choose to compare a school of thought to a religion, but we don’t have to say that.
  • We can choose to say that anyone who isn’t a proponent of the school can’t understand what we have to say, but we don’t have to say that.
  • We can choose to say that everyone belongs to a school (even the people who insist they do not), but we don’t have to say that.

Statements like these are not factual and, to the best of my knowledge, they are not rooted in facts. They reflect choices about how people with differing views should interact.

I think some of the people who say things like this would market themselves more honestly (and in my view, tarnish the Context-Driven Testing brand less) if they would identify themselves as the Rapid Software Testing (TM) school. I would disagree with their approach and their tone, but I wouldn’t feel obliged to assert that such views are not context-driven (see for example, my posts Censure People for Disagreeing with Us? and Context-Driven Testing is Not a Religion and Contexts Differ: Recognizing the Difference between Wrong and WRONG. )

More details of my responses to Rex’s complaints are in the debate itself and in the notes I prepared before the debate at Kaner’s STPCon Debate Slides.

— Cem Kaner

Racial Profiling in Ferguson Missouri? A Note on Statistical Interpretation

August 21st, 2014

As I write this, there is serious community tension in Ferguson, Missouri over the shooting and killing of a unarmed black teenager by a white police officer, and over the response to that shooting by the local police force.

The narrative in much of the press is that this is yet another incident that illustrates a serious problem of racism in the United States, especially in our police. For example, many newspapers cite data from the Missouri 2013 Vehicle Stops Report. The overall report covers data in every county in Missouri and presents years of historical comparisons. The report specifically for Ferguson is here.

Here are some examples of press coverage that seem representative of what I’ve seen in many papers online:

“Last year, 86 percent of the cars stopped by Ferguson police officers were being driven by African-Americans, according to the state’s annual racial profiling report. Once pulled over in Ferguson, African-American drivers were twice as likely to be searched, according to the report.”

http://www.mcclatchydc.com/2014/08/19/237001_feds-could-go-several-ways-in.html?rh=1#storylink=cpy

“Last year, for the 11th time in the 14 years that data has been collected, the disparity index that measures potential racial profiling by law enforcement in the state got worse. Black Missourians were 66 percent more likely in 2013 to be stopped by police, and blacks and Hispanics were both more likely to be searched, even though the likelihood of finding contraband was higher among whites.”

http://www.stltoday.com/news/opinion/columns/the-platform/editorial-michael-brown-and-disparity-of-due-process/article_40bb2d0e-8619-534a-b629-093ebc79f0a6.html

Of course, there are some counter-examples. Some news reports (what I’ve seen on Fox, for example) seem to ignore these data completely and instead appear to me to present the events in terms of violent bad black people who deserve whatever violent treatment the police provide for them. There is nothing useful to learn about data evaluation from these reports, so I will ignore them for the rest of this note.

I should state my bias: My personal (nonexpert) impression is that the shooting was unjustified and that the St. Louis County police response has been inappropriate. I have no insight into the motivation of anyone involved.

However, if you look at the actual numbers from Ferguson, it is not clear to me that the conclusions of racial profiling, conclusions like the ones quoted above, that have appeared in every news source that I respect, are justified by the data.

The focus of this blog is on the teaching of software engineering topics, primarily software testing and measurement (and thus too, statistical analysis).

The data from Ferguson provide an interesting example for caution in the interpretation of such data.

First, some of the numbers that are consistent with the summaries. According to the Attorney General’s report

  • Ferguson’s population (age 16 and over) is 15,865, of whom 63% are black.
  • 4632 of 5384 vehicle stops (86%) were of blacks, a much higher percentage than the 63% of the population
  • 562 of the 611 searches (92%) were of blacks
  • 483 of the 521 arrests (93%) were of blacks
  • 12.13% of the blacks who were stopped were searched, compared to only 6.85% of the whites
  • 21.71% of the blacks who were searched had contraband (drugs, weapons, stolen property) compared to 34.04% of the whites.

These data appear to suggest two conclusions:

  1. Blacks are being stopped, searched and arrested at a higher rate than their representation in the population
  2. Many more searches of blacks than whites are unproductive, suggesting the police would find more contraband if they searched fewer blacks and more whites.

If Ferguson’s police are continuing to search blacks at a much higher frequency than whites, even though searched whites have contraband at a higher frequency than searched blacks, this appears to suggest a pattern that is racist and counterproductive (less protective of public safety).

That conclusion, I think, is the conclusion the newspapers are inviting us to draw.

Let’s look at some more data.

  • 66% (369/562) of the searches and 76% (369/483) of the arrests of blacks involved an outstanding warrant
  • 30% (14/47) of the searches and 39% (14/36) of the arrests of whites involved an outstanding warrant

Searches and arrests involving warrants don’t involve much exercising of judgment on the part of the officer who is searching or arresting someone.

  • A warrant is an order from a court to arrest someone. The officer is supposed to stop and arrest a person if there is a warrant out for them.
  • When a police officer arrests someone, they must search the person. Among the many important reasons for this rule is the safety of the officer: arresting someone and then not checking them carefully for weapons would be extremely unwise.

In a community of only 15,865 people, it would not be surprising for the local police to be aware of most of the people who have warrants outstanding against them or for these police to recognize those people on the street.

Because the police are supposed to arrest people who have warrants against them and supposed to search people they arrest, I don’t think we should count these numbers of stops, searches and arrests against the police.

If you look only at the stops that didn’t involve outstanding warrants,

  • 34% of the times that police searched a black person, and 24% of the times the police arrested a black person, the search did not involve an outstanding warrant.

In contrast

  • 70% of the times that the police searched a white person, and 61% of the time they arrested a white person, the search did not involve an outstanding warrant.

The conclusions that these numbers suggest to me are that:

  • the Ferguson police appear to have been making discretionary stops (stops in which they were exercising their own judgment, rather than executing a court order) of white people at almost twice the rate as for black people
  • the higher contraband-find rate for whites than blacks might be because a higher proportion of whites were searched on the basis of police suspicion of contraband, compared to a higher proportion of blacks being searched as part of an arrest that involved a warrant (past bad behavior, not currently suspicious behavior). Considered this way, the disparity (higher rate of contraband-finds for whites versus blacks) seems unsurprising and not at all suggestive of bad police work.

I don’t know what truth underlies these numbers. I think that, for me to interpret them with any confidence, I would have to do other studies, such as riding along with Ferguson police and learning how they decide who to stop and what post-stop behaviors trigger further investigation (such as searches or checks for outstanding warrants).

What does seem clear to me is that the first conclusion (racially-motivated differences in the police officers’ decisions to search people) is not supported by these data. That motivation might be present but—despite first appearances—these data do not seem to be evidence of it.

I wrote this note because it suggests two important lessons for students of statistics and research design:

  1. In many cases (as here), data may show statistically significant (large, probably consistent) differences. However, interpretation of those differences is almost always open to further investigation.
    • The numbers don’t tell you what they mean. Even the most statistically significant trends must be interpreted by people.
  2. In many cases (as here), the data support alternative interpretations.
    • Whenever possible, you should look at your data in many ways, to see if they tell you the same story. If they don’t, you need to investigate further, and maybe fix your model.
    • A few numbers in isolation tell you very little, often much less than you would initially imagine.
    • If you design your research (or your management) so that you will see only a few numbers at the end, you are designing tunnel vision into your work. You are creating your own context for bad interpretations and bad decisions.

BBST Domain Testing Pilot: A few seats still available (June 22-July 19)

May 11th, 2014

We’re putting the finishing touches on BBST Domain Testing before the online pilot starts June 22. We have a handful of seats still available. (Sorry, the class is now full and we have had to close registration CK — 5/23/2014)

Apply to participate here

If you are selected, we will require a $50 non-refundable deposit to defray the cost of hosting the class. You will also need The Domain Testing Workbook (either print or electronic format available at contextdrivenpress.com or Amazon) to use in the class.

You can read Chris Kenst’s report on his experience in January’s face-to-face pilot at http://www.testingcircus.com/?wpdmact=process&did=Mi5ob3RsaW5r

As with all the BBST courses, this course comes with lots of homework. You should expect to work 10-15 hours per week on the course.

If you have questions, please contact us at info@bbst.info.

 

On the Quality of Qualitative Measures

April 28th, 2014

On the Quality of Qualitative Measures

Cem Kaner, J.D., Ph.D. & Rebecca L. Fiedler, M.B.A., Ph.D.

This is an informal first draft of an article that will summarize some of the common guidance on the quality of qualitative measures.

  • The immediate application of this article is to Kaner’s courses on software metrics and software requirements analysis. Students would be well-advised to read this summary of my lectures carefully (yes, this stuff is probably on the exam).
  • The broader application is to the increasingly large group of software development practitioners who are considering using qualitative measures as a replacement for many of the traditional software metrics. For example, we see a lot of attention to qualitative methods in the agenda of the 2014 Conference of the Association for Software Testing. We won’t be able to make it to CAST this year, but perhaps these notes will provide some additional considerations for their discussions.

On Measurement

Managers have common and legitimate informational needs that skilled measurement can help with. They need information in order to (for example…)

  • Compare staff
  • Compare project teams
  • Calculate actual costs
  • Compare costs across projects or teams
  • Estimate future costs
  • Assess and compare quality across projects and teams
  • Compare processes
  • Identify patterns across projects and trends over time

Executives need these types of information, whether we know how to provide them or not.

Unfortunately, there are strong reasons to be concerned about the use of traditional metrics to answer questions like these. These are human performance measures. As such, they must be used with care or they will cause dysfunction (Austin, 1996). That has been a serious real-life problem (e.g. Hoffman 2000). The empirical basis supporting several of them has been substantially exaggerated (Bossavit, 2014). Many of the managers who use them know so little about mathematics that they don’t understand what their measurements mean and their primary uses are to placate management or intimidate staff. Many of the consultants who give talks and courses advocating metrics also seem to know little about mathematics or about measurement theory. They seem unable to distinguish strong questions from weak ones, unable to discuss the underlying validity of the measures they advocate, and so they seem reliant on appeals to authority, on the intimidating quality of published equations, and on the dismissal of the critic as a nitpicker or an apologist for undisciplined practices.

In sum, there are problems with the application of traditional metrics in our field. It is no surprise that people are looking for alternatives.

In a history of psychological research methods, Kurt Danziger (1994) discusses the distorting impact of quantification on psychological measurement. (See especially his Chapter 9, From quantification to methodolatry.) Researchers designed experiments that looked more narrowly at human behavior, ignoring (designing out of the research) those aspects of behavior or experience that they could not readily quantify and interpret in terms of statistical models.

“All quantitative data is based upon qualitative judgments.”
(Trochim, 2006 at http://www.socialresearchmethods.net/kb/datatype.php)

Qualitative methods might sometimes provide a richer description of a project or product that is less misleading, easier to understand, and more effective as a source of insight. However, there are problems with the application of qualitative approaches.

  • Qualitative reports are, at their core, subjective.
  • They are subject to bias at every level (how the data are gathered or selected, stored, analyzed, interpreted and reported). This is a challenge for every qualitative researcher, but it is especially significant in the hands of an untrained researcher.
  • They are based on selected data.
  • They aren’t very helpful for making comparisons or for providing quantitative estimates (like, how much will this cost?).
  • They are every bit as open to abuse as quantitative methods.
  • And it costs a lot of effort to do qualitative measurement well.

We are fans of measurement (qualitative or quantitative) when it is done well and we are unenthusiastic about measurement (qualitative or quantitative) when it is done badly or sold overenthusiastically to people who aren’t likely to understand what they’re buying.

Because this paper won’t propagandize qualitative measurement as the unquestioned embodiment of sweetness and light, some readers might misunderstand where we are coming from. So here is a little about our background.

  • As an undergraduate, Kaner studied mainly mathematics and philosophy. He also took two semesters of coursework with Kurt Danziger. We only recently read Danziger (1994) and realized how profoundly Danziger has influenced Kaner’s professional development and perspective. As a doctoral student in experimental psychology, Kaner did some essentially-qualitative research (Kaner et al., 1978) but most of his work was intensely statistical, applying measurement theory to human perception and performance (e.g. Kaner, 1983). He applied qualitative methods to client problems as a consultant in the 1990’s. His main stream of published critiques of traditional quantitative approaches started in 1999 (Kaner, 1999a, 1999b). He wrote explicitly about the field’s need to use qualitative measures in 2002. He started giving talks titled “Software Testing as a Social Science” in 2004, explicitly identifying most software engineering measures as human performance measures subject to the same types of challenges as we see in applied measurement in psychology and in organizational management.
  • Fiedler’s (2006, 2007) dissertation used Cultural-Historical Activity Theory (CHAT)–a framework for analyzing and organizing qualitative investigations–to examine portfolio management software in universities. Kaner & Fiedler started applying CHAT to scenario test design in 2007. We presented a qualitative-methods tutorial and a long paper with detailed pointers to the literature at CAST in 2009 and at STPCon in Spring 2013. We continue to use and teach these ideas and have been working for years on a book relating qualitative methods to the design of scenario tests.

We aren’t new to qualitative methods. This is not a shiny new fad for us. We are enthusiastic about increasing the visibility and use of these methods but we are keenly aware of the risk of over-promoting a new concept to the mainstream in ways that dilute the hard parts until all that remains are buzzwords and rituals. (For us, the analogies are Total Quality Management, Six Sigma, and Agile Development.)

Perhaps some notes on what makes qualitative measures “good” (and what doesn’t) might help slow that tide.

No, This is Not Qualitative

Maybe you have heard a recommendation to make project status reporting more qualitative. To do this, you create a dashboard with labels and faces. The labels identify an area or issue of concern, such as how buggy the software is. And instead of numbers, use colored faces because this is more meaningful. A red frowny-face says, There is trouble here. A yellow neutral-face says, Things seem OK, nothing particularly good or bad to report now. And a green smiley-face says, Things go well. You could add more differentiation by having a brighter red with a crying-face or a screaming-or-cursing-face and by having a brighter green with a happy-laughing face.

See, there are no numbers on this dashboard, so it is not quantitative, right?

Wrong.

The faces are ordered from bad to good. You can easily assign numerals to these, 1 for red-screaming-face through 5 for green-laughing-face, you can talk about the “average” (median) score across all the categories of information, you can even draw graphs of the change of confidence (or whatever you map to happyfacedness) from week to week across the project.

This might not be very good quantitative measurement but as qualitative measurement it is even worse. It uses numbers (symbols that are equivalent to 1 to 5) to show status without emphasizing the rich detail that should be available to explain and interpret the situation.

When you watch a consultant present this as qualitative reporting, send him away. Tell him not to come back until he actually learns something about qualitative measures.

OK, So What is Qualitative?

A qualitative description of a product or process is a detail-rich, multidimensional story (or collection of stories) about it. (Creswell, 2012; Denzin & Lincoln 2011; Patton, 2001).

For example, if you are describing the value of a product, you might present examples of cases in which it has been valuable to someone. The example wouldn’t simply say, “She found it valuable.” The example would include a description of what made it valuable, perhaps how the person used it, what she replaced with it, what made this one better than the last one, and what she actually accomplished with it. Other examples might cover different uses. Some examples might be of cases in which the product was not useful, with details about that. Taken together, the examples create an overall impression of a pattern – not just the bias of the data collector spinning the tale he or she wants to tell. For example, the pattern might be that most people who try to do THIS with the product are successful and happy with it, but most people who try to do THAT with it are not, and many people who try to use this tool after experience with this other one are likely to be confused in the following ways …

When you describe qualitatively, you are describing your perceptions, your conclusions, and your analysis. You back it up with examples that you choose, quotes that you choose, and data that you choose. Your work should be meticulously and systematically even-handed.  This work is very time-consuming.

Quantitative work is typically easier, less ambiguous, requires less-detailed knowledge of the product or project as a whole, and is therefore faster.

If you think your qualitative measurement methods are easier, faster and cheaper than the quantitative alternatives, you are probably not doing the qualitative work very well.

Quality of Qualitative

In quantitative measurement, questions about the value of a measure boil down to questions of validity and reliability.

A measurement is valid to the extent that it provides a trustworthy description of the attribute being measured. (Shadish, Cook & Campbell, 2001)

A measurement is reliable to the extent that repeating the same operations (measuring the same thing in the same ways) yields the same (or similar) results.

In qualitative work, the closest concept corresponding to validity is credibility (Guba & Lincoln, 1989). The essential question about the credibility of a report of yours is, Why should someone else trust your work? Here are examples of some of the types of considerations associated with credibility.

Examples of Credibility-Related Considerations

The first and most obvious consideration is whether you have the background (knowledge and skill) to be able to collect, interpret and explain this type of data.

Beyond that, several issues come up frequently in published discussions of credibility. (Our presentation is based primarily on Agostinho, 2005; Creswell, 2012; Erlandson et al., 1993; Finlay, 2006; and Guba & Lincoln, 1989.)

  • Did you collect the data in a reasonable way?
    • How much detail?: Students of ours work with qualitative document analysis tools, such as ATLAS.ti, Dedoose, and NVivo. These tools let you store large collections of documents (such as articles, slides, and interview transcripts), pictures, web pages, and videos (https://en.wikipedia.org/wiki/Computer_Assisted_Qualitative_Data_Analysis_Software). We are now teaching scenario testers to use the same types of tools. If you haven’t worked with one of these, imagine a concept-mapping tool that allows you to save all the relevant documents as sub-documents in the same document as the map and allows you to show the relationships among them not just with a two-dimensional concept map but with a multidimensional network, a set of linkages from any place in any document to any place in any other document.

    As you see relevant information in a source item, you can code it. Coding means applying meaningful tags to the item, so that you can see later what you were thinking now. For example, you might code parts of several documents as illustrating high or low productivity on a project. You can also add comments to these examples, explaining for later review what you think is noteworthy about them. You might also add a separate memo that describes your ideas about what factors are involved in productivity on this project, and another memo that discusses a different issue, such as notes on individual differences in productivity that seem to be confounding your evaluation of tool-caused differences. Later, you can review the materials by looking at all the notes you’ve made on productivity—all the annotated sources and all your comments.

    You have to find (or create) the source materials. For example, you might include all the specification-related documents associated with a product, all the test documentation, user manuals from each of your competitors, all of the bug reports on your product and whatever customer reports you can capture for other products, interviews with current users, including interviews with extremely satisfied users, users who abandoned the product and users who still work with the product but hate it. Toss in status reports, comments in the source code repository, emails, marketing blurbs, and screen shots. All these types of things are source materials for a qualitative project.

    You have to read and code the material. Often, you read and code with limited or unsophisticated understanding at first. Your analytical process (and your ongoing experience with the product) gives you more insight, which causes you to reread and recode material. The researcher typically works through this type of material in several passes, revising the coding structure and adding new types of comments (Patton, 2001). New information and insights can cause you to revise your analysis and change your conclusions.

    The final report gives a detailed summary of the results of this analysis.

    • Prolonged engagement: Did you spend enough time at the site of inquiry to learn the culture, to “overcome the effects of misinformation, distortion, or presented ‘fronts’, to establish rapport and build the trust necessary to overcome constructions, and to facilitate immersing oneself in and understanding the context’s culture”?
    • Persistent observation: Did you observe enough to focus on the key elements and to add depth? The distinction between prolonged engagement and persistent observation is the difference between having enough time to make the observations and using that time well.
    • Triangulation and convergence. Triangulation leads to credibility by using different or multiple sources of data (time, space, person), methods (observations, interviews, videotapes, photographs, documents), investigators (single or multiple), or theory (single versus multiple perspectives of analysis).” (Erlandson et al. 1993, p. 137-138). “The degree of convergence attained through triangulation suggests a standard for evaluating naturalistic studies. In other words, the greater the convergence attained through the triangulation of multiple data sources, methods, investigators, or theories, the greater the confidence in the observed findings. The convergence attained in this manner, however, never results in data reduction but in an expansion of meaning through overlapping, compatible constructions emanating from different vantage points.” (Erlandson et al. 1993, p. 139).
  • Are you summarizing the data fairly?
  • How are you managing your biases (people are often not conscious of the effects of their biases) as you select and organize your observations?
  • Are you prone to wishful thinking or to trying to please (or displease) people in power?
    • Peer debriefing: Did you discussion your ideas with one or more disinterested peers who gave constructively critical feedback, questioned your ideas, methods, motivation, and conclusions?
    • Disconfirming case analysis: Did you look for counter-examples? Did you revise your working hypotheses in light of experiences that were inconsistent with them?
    • Progressive subjectivity: As you observed situations or created and looked for data to assess models, how much did you pay attention to your own expectations? How much did you consider the expectations and observations of others. An observer who affords too much privilege to his or her own ideas is not paying attention.
    • Member checks: If you observed / measured / evaluated others, how much did you involve them in the process? How much influence did they have over the structures you would use to interpret the data (what you saw or heard or read) that you got from them? Do they believe you accurately and honestly represented their views and their experiences? Did you ask?

Transferability

The concerns that underlie transferability are much the same as for external validity (or generalization validity) of traditional metrics:

  • If you or someone else did a comparable study in a different setting, how likely is it that you would make the same observations (see similar situations, tradeoffs, examples, etc.)?
  • How well would your conclusions apply in a different setting?

When evaluating a research report, thorough description is often a key element. The reader doesn’t know what will happen when someone tries a similar study in the future, so they (and you) probably cannot authoritatively predict generalizability (whether people in other settings will see the same things). However, if you describe what you saw well enough, in enough detail and with enough attention to the context, then when someone does perform a potentially-comparable study somewhere else, they will probably be able to recognize whether they are seeing things that are similar to what you were seeing.

Over time, a sense of how general something is can build as multiple similar observations are recorded in different settings

Dependability

The concerns that underlie dependability are similar to those for internal validity of traditional metrics. The core question is whether your work is methodologically sound.

Qualitative work is more exploratory than quantitative (at least, more exploratory than quantitative work is traditionally described). You change what you do as you learn more or as you develop new questions. Therefore consistency of methodology is not an ultimate criterion in qualitative work, as it is for some quantitative work.

However, a reviewer can still ask how well (methodologically) you do your work. For example:

    • Do you have the necessary skills and are you applying them?
    • If you lack skills, are you getting help?
    • Do you keep track of what you’re doing and make your methodological changes deliberately and thoughtfully? Do you use a rigorous and systematic approach?

Many of the same ideas that we mentioned under credibility apply here too, such as prolonged engagement, persistent observation, effort to triangulate, disconfirming case analysis, peer debriefing, member checks and progressive subjectivity. These all describe how you do your work.

  • As issues of credibility, we are asking whether you and your work are worth paying attention to? Your attention to methodology and fairness reflect on your character and trustworthiness.
  • As issues of methodology, we are asking more about your skill than about your heart.

Confirmability

Confirmability is as close to reliability as qualitative methods get, but the qualitative approach does not rest as firmly on reliability. The quantitative measurement model is mechanistic. It assumes that under reasonably similar conditions, the same acts will yield the same results. Qualitative researchers are more willing to accept the idea that, given what they know (and don’t know) about the dynamics of what they are studying, under seemingly-similar circumstances, the same things might not happen next time.

We assess reliability by taking repeated measurements (do similar things and see what happens). We might assess confirmability as the ability to be confirmed rather than whether the observations were actually confirmed. From that perspective, if someone else works through your data:

  • Would they see the same things as you?
  • Would they generally agree that things you see as representative are representative and things that you see as idiosyncratic are idiosyncratic?
  • Would they be able to follow your analysis, find your records, understand your ways of classifying things and agree that you applied what you said you applied?
  • Does your report give your reader enough raw data for them to get a feeling for the confirmability of your work?

In Sum

Qualitative measurements tell a story (or a bunch of stories). The skilled qualitative researcher relies on transparency in methods and data to tell persuasive stories. Telling stories that can stand up to scrutiny over time takes enormous work. This work can have great value, but to do it, you have to find time, gain skill, and master some enabling technology. Shortchanging any of these areas can put your credibility at risk as decision-makers rely on your stories to make important decisions.

References

S. Agostinho, (2005, March). “Naturalistic inquiry in e-learning research“, International Journal of Qualitative Methods 4 (1).

R. D. Austin (1996). Measuring and Managing Performance in Organizations. Dorset House.

L. Bossavit (2014). The Leprechauns of Software Engineering: How folklore turns into fact and what to do about it. Leanpub.

J. Creswell (2012, 3rd ed.). Qualitative Inquiry and Research Design: Choosing Among Five Approaches. Sage Publications.

K. Danziger (1994). Constructing the Subject: Historical Origins of Psychological Research. Cambridge University Press.

N.K. Denzin & Y.S. Lincoln (2011, 4th ed.) The SAGE Handbook of Qualitative Research. Sage Publications.

D.A. Erlandson, E.L. Harris,B.L. Skipper & S.D. Allen, S. D. (1993). Doing Naturalistic Inquiry: A Guide to Methods. Sage Publications.

R. L. Fiedler (2006). “In transition”: An activity theoretical analysis examining electronic portfolio tools’ mediation of the preservice teacher’s authoring experience. Unpublished Ph.D. dissertation, University of Central Florida (Publication No. AAT 3212505).

R.L. Fiedler (2007). “Portfolio authorship as a networked activity“. Paper presented at the Society for Information Technology and Teacher Education.

R. L. Fiedler & C. Kaner (2009). “Putting the context in context-driven testing (an application of Cultural Historical Activity Theory)“. Conference of the Association for Software Testing. Colorado Springs, CO.

L. Finlay (2006). “‘Rigour’, ‘Ethical Integrity” or ‘Artistry”? Reflexively reviewing criteria for evaluating qualitative research.” British Journal of Occupational Therapy. 69 (7), 319-326.

E.G. Guba & Y.S. Lincoln (1989). Fourth Generation Evaluation. Sage Publications.

D. Hoffman (2000). “The darker side of metrics,” presented at Pacific Northwest Software Quality Conference, Portland, OR.

C. Kaner (1983). Auditory and visual synchronization performance over long and short intervals, Doctoral Dissertation: McMaster University.

C. Kaner (1999a). “Don’t use bug counts to measure testers.” Software Testing & Quality Engineering, May/June, 1999, p. 80.

C. Kaner (1999b). “Yes, but what are we measuring?” (Invited address) Pacific Northwest Software Quality Conference, Portland, OR.

C. Kaner (2002). “Measuring the effectiveness of software testers.”15th International Software Quality Conference (Quality Week), San Francisco, CA.

C. Kaner (2004). “Software testing as a social science.” IFIP Working Group 10.4 meeting on Software Dependability, Siena, Italy.

C. Kaner & R. L. Fiedler (2013). “Qualitative Methods for Test Design“. Software Test Professionals Conference (STPCon), San Diego, CA.

C. Kaner, B. Osborne, H. Anchel, M. Hammer & A.H. Black (1978). “How do fornix-fimbria lesions affect one-way active avoidance behavior?86th Annual Convention of the American Psychological Association, Toronto, Canada.

M.Q. Patton (2001, 3rd ed.). Qualitative Research & Evaluation Methods. Sage Publications.

W.R. Shadish, T.D. Cook & D.T. Campbell (2001, 2nd ed.). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Cengage.

W.M.K. Trochim (2006). Research Methods Knowledge Base.

Wikipedia (2014). https://en.wikipedia.org/wiki/Computer_Assisted_Qualitative_Data_Analysis_Software

 

Why propose an advanced certification in software testing?

March 26th, 2014

A couple of weeks ago, I posted A proposal for an advanced certification in software testing. There were plenty of comments, on the blog, on Twitter, and in private email to me.

I think the best way to respond to these is with a series of posts, each one focused on a different issue. This first one goes to the fundamental question, Why should we create such a thing?

I used to see certifications as irrelevant (and misleading)

For a long time, when people asked me whether they should get certified in software testing, I said no. I would say that, in my opinion, there is no value in the current certifications.

I know more good testers who are not certified than good ones who are certified. I feel as though I’ve met a whole lot of clueless fools who carry testing certifications.

Many of the exam-review courses teach to the exam and present an oversimplified and outdated view of the field. I think that, from a what-will-you-learn perspective, taking them is a waste of time and money.

It used to seem obvious to me that certification must be irrelevant to a tester’s career.

The market proved me wrong

Unfortunately, my predictions that the community would see the ISTQB/ASQ/QAI-type credential as irrelevant were proved wrong.

The fact that hundreds of thousands of people in our field have decided to get certified demonstrates, in and of itself, that the credential is widely perceived as relevant.

I think that a willingness to discover and publish that you were mistaken is one of the critical traits of a scientist. The history of science is the story of of a never-ending stream of ideas that were well supported at the time—but were proved wrong. They were replaced with better ideas that were more useful and better-supported—and proved wrong too.

It seems to me that I can’t be a great tester (or an adequate scientist) if I am reluctant to subject my beliefs and ideas to the same level of criticism that I apply to the work of others.

In retrospect, I realize that I misread the evolution of certification in 1990 through 2010.

  • The testing community had a growing core of people who had decided to do this work as a career. They believed they were committed to doing good work and that they were good at what they did.
  • The demand for testing services was exploding, with floods of new people who had little background, varying levels of commitment and increasingly inflated salary expectations.
  • Many of the people who saw themselves as professionals were getting tired of being characterized as unskilled, clueless bureaucrats by so many other people in the development community.
  • Many of the people involved in recruiting testers or setting their pay scales don’t know enough about testing to tell the good ones from incompetents who can spin persuasive resumes and interviews.
  • In this environment, even if you are a test manager with really good hiring instincts, you still have the challenge of justifying the salaries you want to pay to people who don’t understand your staff.

Certification was sold as a formal credential, something that demonstrates (at a minimum) that you are committed enough to the field to go through the hassle of getting certified. And as proof that you are at least familiar with the basics of the field and that you are good enough at precision reading to be able to pass a formal exam.

If there is no stronger credential in the field, it is easy to see this as better than nothing.

I think that some of the get-certified sales pitches goes far beyond than this, saying or implying that certification demonstrates that a person has genuine professional competence. I think that goes far beyond what any of these certifications could possibly attest to, but I think that’s the impression that is sometimes encouraged.

We can argue about the motivation and about the marketing. We can speculate endlessly about why someone would spend good money on exam-prep courses so they could get one or more of these certifications.

I think it is more useful to ask whether we can give them better value for their time and money.

One approach: Open Certification

My interest in creating a better alternative to the current certifications is not new. Back in 2006, Mike Kelly and I started hosting workshops to plan an “Open Certification”. The idea was to create a huge, open pool of multiple-choice questions and to examine candidates via a random stratified sample of questions from the pool. However, there were some insurmountable problems:

  • We were determined to not be tied to one proprietary body of knowledge. But consider this example: Suppose we are willing to accept six different widely-used definitions of “test case.” Which one is the right one for this exam? And what if the student encounters (and answers on the basis of) Definition 7? How do we say that one is wrong?
    • The obvious way to deal with this is to write the question to say “Famous Person 1’s definition of test case is …” but what do people have to do to prepare for such an exam? Do they have to memorize 6 different definitions and the names of the people we tie those definitions to? Almost no one could pass such an exam. An even if you could pass it, all the memorizing you would have to do in order to pass it would be an abuse of your time.
  • We were determined, back then, to do something extremely cheap or free. But the development and maintenance costs for the software and questions were going to be very high. Even if we could get volunteer labor to create the first drafts of the exam (and exam site), we would need to do a lot of sustaining engineering. People were going to have to be paid.
  • The exam would be free but with this complex a series of questions, how long would it be before training companies started selling exam prep courses? The cost of the exams is not the big cost factor in the other certifications. It is the cost of the training. Were we kidding ourselves about making a difference here?
  • Finally, there was the most difficult problem. Even if the exam was successful, it would still be a bunch of multiple-choice questions. Our approach to certification wouldn’t be offering any better evidence of deep knowledge or skill than the others.

I forget his exact words, but Mike laid out an important criterion early in the project. If we couldn’t be confident of developing something clearly better than the alternative we were replacing, we shouldn’t bother doing it. As we proceeded, it became clearer and clearer that we were creating something that might be cheaper, but that probably wasn’t better.

Eventually, we pulled the plug on Open Certification.

But that was not abandonment of the idea of a better certification. It was a recognition that we didn’t have a better idea, yet.

In parallel with the Open Certification project, I was transforming BBST from a purely academic course to a very student-challenging industrial course.

One of the really valuable outcomes of the Open Certification meetings was a “standard” for drafting challenging multiple-choice test questions. I applied this to the BBST courses, creating a suite of quiz questions that BBST’s graduates have come to know and love.

But we didn’t stop with multiple-choice. We used multiple-choice as a tutorial tool, not as the core examiner. BBST demanded a much higher level of knowledge and skill than I knew how to get from multiple-choice exams. I concluded that something along these lines was a better way to go.

Another alternative

Rather than trying to replace the ASQ/ISTQB/QAI approach,  I think we can build on it.

  • Let people get one of those credentials. Or let them get some other credential that is challenging but that approaches the field in less simplistic terms. Treat their credential-from-training as a baseline.
  • From here, let the tester present a portfolio of evidence that s/he can do more than just pass an exam or two—that s/he can actually do competent work in the field.

The person who can demonstrate both, mastery of basic training and a competent portfolio gets an advanced certification.

I think this gives us two important advances:

  • It breaks out of the ideological stranglehold that a few vendors have had on credentialing in our field.
  • It presents a richer view of the capabilities and contributions of the person who carries the credential.

This isn’t perfect, but it’s better. I think that has some value.

 

A proposal for an advanced certification in software testing

March 3rd, 2014

This is a draft of a proposal to create a more advanced, more credible credential (certification) in software testing.

The core idea is a certification based on a multidimensional collection of evidence of education, experience, skill and good character.

  • I think it is important to develop a credential that is useful and informative.
    • I think we damage the reputation of the field if we create a certification that requires only a shallow knowledge of software testing.
    • I think we damage the value of the certification if we exaggerate how much knowledge or skill is required to obtain it.
  • I think it is important to find a way to tolerate different approaches to software testing, and different approaches to training software testers. This proposal is not based on any one favored “body of knowledge” and it is not tied to any one ideology or group of vendors.

The idea presented here is imperfect—as are the other certifications in our field. It can be gamed—as can the others. Someone who is intent on gaining a credential via cheating and fraud can probably get away with it for a while—but the others have security risks too. This certification does not assure that the certified person is competent—neither do the others. The certification does not subject the certified person to formal professional accountability for their work—neither do the others—and even though certificate holders say that they will follow a code of ethics, we have no mechanism for assuring that they do or punishing them if they don’t—and neither do the others.

With all these we-don’t-do-thises and we-don’t-promise-thats, you might think I’m kidding about this being a real proposal. I’m not.

Even if we agree that this proposed certification lacks the kinds of powers that could be bestowed by law or magic, I think it can provide useful information and that it can create incentives that favor higher ethics in job-seeking and, eventually, professional practice. It is not perfect, but I think it is far better than what we have now.

The Proposal

This credential is based on a collection of several different types of evidence that, taken together, indicate that the certificate holder has the knowledge and skill needed to competently perform the usual services provided by a software tester.

Here are the types of evidence. As you read this, imagine that the Certification Body hosts a website that will permanently post a publicly-viewable dossier (a collection of files) for every person certified by that body. The dossier would include everything submitted by an applicant for certification, plus some additional material. Here’s what we’d find in the file.

Authorization by the Applicant

As part of the application, the applicant for Certification would grant the Certification Board permission to publish all of the following materials. The applicant would also sign a legal waiver that would shield the Board from all types of legal action by the applicant / Certified Tester arising out of publication of the materials described below. The waiver will also authorize the Board to exercise its judgment in assessing the application and will shield the Board from legal action by the applicant if the Board decides, in its unfettered discretion, to reject the applicant’s application or to later cancel the applicant’s Certification.

Education (Academic)

The Certified Tester should have at least a minimum level of formal education. The baseline that I imagine is a bachelor’s-level degree in a field relevant to software testing.

  • Some fields, such as software engineering, are obviously relevant to software testing. But what about others like accounting, mathematics, philosophy, physics, psychology, or technical writing? We would resolve this by requiring the applicant for certification to explain in writing how and why her or his education has proved to be relevant to her or his experiences as a tester and why it should be seen as relevant education for someone in the field.
  • The requirement for formal education should be waived if the applicant requests waiver and justifies the request on the basis of a sufficient mix of practical education and professional achievement.

Education (Practical)

The Certified Tester should have successfully completed a significant amount of practical training in software testing. Most of this training would typically be course-based, typically commercial training. Some academic courses in software testing would also qualify. A non-negotiable requirement is successful completion of at least some courses that are considered advanced. “Successful” completion means that the student completed an exam or capstone project that a student who doesn’t know the material would not pass.

  • There is an obvious accreditation issue here. Someone has to decide which courses are suitable and which are advanced.
  • I think that many different types of courses and different topics might be suitable as part of the practical training. For example, suppose we required 100 classroom-hours of training (1 training day = 6 classroom hours). Perhaps 60 of those hours could be in related fields (programming, software metrics, software-related law, project accounting, etc.) but a core would have to be explicitly focused on testing.
  • I think the advanced course hours (24 classroom hours?) would have to be explicitly advanced software testing courses.
  • There is no requirement that these courses come from any particular vendor or that they follow any particular software testing or software development ideology.

Examination

The Certified Tester should have successfully completed a proctored, advanced, examination in software testing.

  • This requirement anticipates competing exams offered by several different groups that endorse different approaches to software testing. Our field does not have agreement on one approach or even one vocabulary. The appearance of agreement that shows up in industry “standards” is illusory. As a matter of practice (I think, often good practice), the standards are routinely ignored by practitioners. Examinations that adopt or endorse these standards should be welcome but not mandatory.

Which exams are suitable and which are advanced?

There is an obvious accreditation issue here. Someone has to decide which exams are suitable and which are advanced.

I am inclined to tentatively define an advanced exam as one that requires as minimum prerequisites (a) successful completion of a specified prior exam and (b) additional education and experience. For example, ISTQB Foundations would not qualify but an ISTQB Advanced or Expert exam might. Similarly, BBST:Foundations would not qualify but BBST:Bug Advocacy might and BBST:Domain Testing definitely should.

An exam might be separate from a course or it might be a final exam in a sufficiently advanced course.

For an exam to be used by a Certified Tester, the organization that offers and grades the exam must provide the Certification Board with a copy of a sample exam. The organization must attest under penalty of perjury that they believe the sample is fairly representative of the scope and difficulty of the actual current exam. This sample will appear on the Certification Board’s website, and be accessible as a link from the Certified Tester’s dossier. (Thus, the dossier doesn’t show the Certified Tester’s actual exam but it does show an exam that is comparable to the actual one.)

What about the reliability and the validity of the exams?

Let me illustrate the problem with two contrasting examples:

  • I think it is fair to characterize ISTQB as an organization that is striving to create highly reliable exams. To achieve this, they are driven toward questions that have unambiguously correct answers. Even in sample essay questions I have seen for the Expert exam, the questions and the expected answers are well-grounded in a published, relatively short, body of knowledge. I think this is a reasonable and respectable approach to assessment and I think that exams written this way should be considered acceptable for this certification.
  • The BBST assessment philosophy emphasizes several other principles over reliability. We expect answers to be clearly written, tightly focused on the question that was asked, with a strong logical argument in favor of whatever position the examinee takes in her or his answer, that demonstrates relevant knowledge of the field. We expect a diversity of points of view. I think it gives the examiner greater insight into the creativity and depth of knowledge of the examinee. I think this is also a reasonable and respectable approach to assessment that we should also consider acceptable for this certification.

There is a tradeoff between these approaches. Approaches like ISTQB’s are focused on the reliability of the exam, especially on between-grader reliability. This is an important goal. The BBST exams are not focused on this. For certification purposes, we would expect to improve BBST reliability by using paired grading (two examiners) but this is imperfect. I would not expect the same level of reliability in BBST exams that ISTQB achieves. However, in my view of the assessment of cognitively complex skills, I believe the BBST approach achieves greater validity. Complicating the issue, there are problems in the measurement of both, reliability and validity, of education-related exams.

The difference here is not just a difference of examination style. I believe it reflects a difference in ideology.

Somehow, the Certification Board will have to find a way to accredit some exams as “sufficiently serious” tests of knowledge even though one is obviously more reliable than the other, one is obviously more tightly based on a published body of knowledge than the other, etc.

Somehow, the Certification Board will have to find a way to refuse to accredit some exams even though they have the superficial form of an exam. In general, I suspect that the Certification Board will cast a relatively broad net and that if groups like ASQ and QAI offer advanced exams, those exams will probably qualify. Similarly, I suspect that a final exam in a graduate-level university course that is an “advanced” software testing course (prerequisite being successful completion of an earlier graduate-level course in testing) would qualify.

Professional Achievement

Professional achievements include publications, honors (such as awards), and other things that indicate that the candidate did something at a professional level.

An applicant for certification does not have to include any professional achievements. However, if the applicant provides them, they will become part of the applicant’s dossier and will be publicly visible.

Some decisions will lie in the discretion of the Certification Board. For example, the Certification Board:

  • might or might not accept an applicant’s academic background as sufficiently relevant (or as sufficiently complete)
  • might or might not accept an applicant’s training-experience portfolio as sufficient or as containing enough courses that are sufficiently related to software testing

In such cases, the Certification Board will consider the applicant’s professional achievements as additional evidence of the applicant’s knowledge of the field.

References

The applicant will provide at least three letters of endorsement from other people who have stature in the field. These letters will be public, part of the Certified Tester’s dossier. An endorsement is a statement from a person that, in that person’s opinion, the applicant has the knowledge, skills and character needed to competently provide the services of a professional software tester. The letter should provide additional details that establish that the endorser knows the knowledge, skill and character of the applicant well enough to credibly make an endorsement.

  • A person of stature is someone who is experienced in the field and respected. For example, the person might be (this is not a complete list)
    • personally known to the Certification Board
    • a Certified Tester
    • a Senior Member or Distinguished Member or Fellow of ACM, ASQ, or IEEE
  • If one of the endorsers withdraws his or her endorsement, that withdrawal will be published in the Certified Tester’s dossier along with the original endorsement (now marked “withdrawn”) and the Certified Tester will be required to get a new endorser.
  • If one of the apparent endorsers contacts the Certification Board and asserts that s/he did not write an endorsement for an applicant and that s/he does not endorse the applicant, and if the apparent endorser provides credible proof of identify, that letter will be published in the Certified Tester’s dossier along with the original letter (now marked “disputed”).

Professional Experience

The applicant will provide a detailed description of his or her professional history that includes at least N years of relevant experience.

  • The applicant must attest that this description is true and not materially incomplete. It will be published as part of the dossier. Potential future employers will be able to check the claims made here against the claims made in the applicant’s application for work with them.
  • The descriptions of relevant positions will include descriptions of the applicant’s role(s) and responsibilities, including typical tasks s/he performed in that position
  • The applicant’s years of relevant experience and years of formal education will interact: Someone with more formal education that is relevant to the field will be able to become certified with less relevant experience (but never less than K years of experience).

Continuing Education

The candidate must engage in professional activities, including ongoing study, to keep the certification.

Code of Ethics

The candidates must agree to abide by a specific Code of Ethics, such as the ACM code. We should foresee this as a prelude to creating an enforcement structure in which a Certified Tester might be censured or certification might be publicly canceled for unethical conduct.

Administrative Issues

Somehow, we have to form a Certification Board. The Board will have to charge a fee for application because the website, the accrediting activities, evaluation of applications, marketing of the certification, etc., will cost money.

Benefits

This collection of material does not guarantee competence, but it does present a multidimensional view of the capability of an experienced person in the field. It speaks to a level of education and professional involvement and to the credibility of self-assertions made when someone applies for a job, submits a paper for publication, etc. I think that the public association of the endorser with the people s/he endorses will encourage most possible endorsers to think carefully about who they want to be permanently publicly identified with. I think the existence of the dossier will discourage exaggeration and fraud by the Certified Tester.

It is not perfect, but I think it will be useful, and better than what I think we have now.

This is not a certification of a baseline of competence in the way that certifications (licenses) work in fields like law, engineering, plumbing, and cosmetology. Those are regulated professions in which the certified person is subject to penalties and civil litigation for conduct that falls below baseline. Software engineering (including software testing) is not a regulated profession, there is no such cause of action in the courts as “software engineering malpractice,” and there are no established penalties for incompetence. There is broad disagreement in the field about whether such regulations should exist (for example, the Association for Computing Machinery strongly opposes the licensing of software engineers while the IEEE seems inclined to support it) and the creation of this certification does not address the desirability of such regulation.

The Current Goal: A Constructive Discussion

This article is a call for discussion. It is not yet a call for action, though I expect we’ll get there soon.

This article follows up an article I wrote last May about credentialing systems. I identified several types of credentials in use in our field and suggested four criteria for a better credential:

  • reasonably attainable (people could affort to get the credential, and reasonably smart people who worked hard could earn it),
  • credible (intellectually and professionally supported by senior people in the field who have earned good reputations),
  • scalable (it is feasible to build an infrastructure to provide the relevant training and assessment to many people), and
  • commercially viable (sufficient income to support instructors, maintainers of the courseware and associated documentation, assessors (such as graders of the students and evaluators of the courses), some level of marketing (because a credential that no one knows about isn’t worth much), and in the case of this group, money left over for profit. Note that many dimensions of “commercial viability” come into play even if there is absolutely no profit motive—-the effort has to support itself, somehow).

I think the proposal in this article sketches a system that would meet those criteria.

A more detailed draft of this proposal was reviewed at the 2014 Workshop on Teaching Software Testing. We did not debate alternative proposals or attempt to reach consensus. The ideas in this paper are not the product of WTST. Nor are they the responsibility of any participant at WTST. However, I am here acknowledging the feedback I got at that meeting and thanking the participants: Scott Allman, Janaka Balasooriya, Rex Black, Jennifer Brock, Reetika Datta, Casey Doran, Rebecca L. Fiedler, Scott Fuller, Keith Gallagher, Dan Gold, Douglas Hoffman, Nawwar Kabbani, Chris Kenst, Michael Larsen, Jacek Okrojek, Carol Oliver, Rob Sabourin, Mike Sowers, and Andy Tinkham. Payson Hall has also questioned the reasoning and offered useful suggestions.

To this point, we have been discussing whether these ideas are worthwhile in principle. That’s important and that discussion should continue.

We have not yet begun to tackle the governance and implementation issues raised by this proposal. It is probably time to start thinking about that.

  • I’m positively impressed by (what I know of) the governance model of ISTQB and wonder whether we should follow that model.
  • I would expect to be an active supporter/contributor to the governance of this project (for example an active member of the governing Board). However—just as I helped found AST but steadfastly refused to run for President of AST—I believe we can find a better choice than me for chief executive of the project.

Comments?

New Book: Foundations of Software Testing–A BBST Workbook

February 14th, 2014

New Book: Foundations of Software Testing—A BBST Workbook

Rebecca Fiedler and I just published our first book together, Foundations of Software Testing—A BBST Workbook.

Becky and I started working on the instructional design for the online version of the BBST (Black Box Software Testing) course in 2004. Since then,Foundations has gone through three major revisions. Bug Advocacy and Test Design have gone through two.

Our Workbooks mark our first major step toward the next generation of BBST™.

We are creating the new versions of BBST through Kaner, Fiedler & Associates, a training company that we formed to provide an income stream for ongoing evolution of these courses. BBST is a registered trademark of Kaner, Fiedler & Associates.

What’s in the Book

The Workbook includes slides, lecture transcripts, orientation activities and feedback, application activities, exam advice, and author reflections. Here are some some details:

All the course slides

Foundations has 304 slides. Some of these are out of date. We provide notes on these in the Author’s Reflections.

A transcript of the six lectures.

The transcripts are almost word-for-word the same as the spoken lecture. They actually reproduce the script that I wrote for the lecture. In a few cases, my scripts are a little longer than what actually made it past the video edits. We lay the transcript and the slides out together, side-by-side. In an 8.5×11 printed book, this is a great format for taking notes. Unfortunately, it doesn’t translate to Kindle well, so there is no Kindle edition of the book.

Four Orientation Activities

Orientation activities introduce students to a key challenge considered in a lecture. The student puzzles through the activity for 30 to 90 minutes, typically before watching the lecture, then sees how the lecture approaches this type of problem. The typical Foundations course has two to four of these.

The workbook presents the instructions for four activities, along with detailed feedback on them, based on past performance of students in the online and university courses.

I revised, rewrote or added (new) all of these activities for this Workbook. Because, in my opinion, the most important learning in BBST comes from what the students actually do in the class, the new Orientation and Application activities create a substantial revision to the course.

In my university courses, I practice continuous quality improvement, revising all of them every term in response to (a) my sense (and to what ever relevant data I have collected) about strengths and weaknesses that showed up in previous of the course or (b) ideas that have demonstrated their value in other courses and can be imported into this one. Most of the updates are grounded in a long series of revisions that I used and evaluated in my university-course version of BBST.

Two Application Activities

An application activity applies ideas or techniques presented in a lecture or developed over several lectures. The typical application activity calls for two to six hours of work, per student. The typical Foundations course has one to two of these.

One of these is revised from the public BBST, the other completely rewritten.

Advice on answering our essay-style exam questions

The advice runs 11 pages. I also provide a practice question and detailed feedback on the structure of the answer.

I think the advice is good for anyone taking the course, but it is particularly focused on university students who are preparing for an exam that will yield graded results (A, B, Pass-with-distinction, etc.). The commercial versions of BBST are typically pass-fail, so some of the fine details in this advice are beyond the needs of those readers. If you are a university student, I recommend this as a tighter and more polished presentation than the exam-preparation essay included in the public course.

Author reflections

My reflections present my sense of the strengths and weaknesses of the current course, the ways we are addressing those with the new activities, and some of the changes we see coming in the next generation of videos.

Because Foundations is written to introduce students to the fundamental challenges in software testing, some of my reflections add commentary on widely-debated issues in the field. Some of these might become focus points for the usual crowd to practice their Sturm und Drang on Twitter.

Who the Book is For

We want to support three groups with the BBST Workbooks:

  • Self-studiers. Many people watch the course videos on their own. The course videos for the current version of Foundations are available for free, along with the slides, the course readings, and the public-course versions of four activities and the study guide (list of essay questions for the exam). The Workbook updates the activities, and provides detailed feedback for all of the orientation activities, and provides several design notes on the orientation and application activities. If you are studying BBST on your own or with a few friends, we believe this provides much better support than the videos alone.
  • In-house trainers. If you are planning to teach BBST to staff at your company, the Workbook is an inexpensive textbook to support the students. The feedback for the activities provides a detailed survey of common issues and ideas raised in each activity. If your trainees submit their work to you for review, you might want to supplement these notes with comments that are specific to each student’s work. The comments in the workbook should cover most of the comments that you would otherwise repeat from student to student. The instructors’ reflections will, we hope, give you ideas about how to tailor the application activities (or replace them) to make them suitable for your company.
  • Students in instructor-led courses. The BBST Foundations in Software Testing Workbook is an affordably-priced (retail price $19.99) supplement to any instructor-led course. Students will appreciate the convenience of print versions of the course slides and lectures for ongoing reference. Instructors will appreciate the level of feedback provided to students in the workbook.

Buy the Book
Foundations of Software Testing: A BBST Workbook is available from:

Evolution of the BBST Courses

February 13th, 2014

With our first teaching of the new BBST:Domain Testing course (based on The Domain Testing Workbook) and our revision of BBST:Foundations with the Foundations of Software Testing workbook, Rebecca Fiedler and I have started to introduce the next generation of BBST. Recently, we’ve been getting requests for papers or interviews on where BBST came from and where it’s going.

  • This note is a short summary of the history of BBST. You can find many more details in the articles I’ve been posting to this blog over the last decade, as I tried to think through the course design’s strengths, weaknesses and potential.
  • My next post, and an upcoming article in Testing Circus, look at the road ahead.

What is BBST™ ?

BBST is a series of courses on Black Box Software Testing. The overall goal of the series is to improve the state of the practice in software testing by helping testers develop useful testing skills and deeper insights into the challenges of the field.

(Note: BBST is a registered trademark of Kaner, Fiedler & Associates.)

Today’s BBST Courses

Today, most people familiar with the BBST courses think of a four-week, fully online course. Rebecca Fiedler and I started working on the instructional design for the BBST online courses back in 2004, with funding from the National Science Foundation. The courses have gotten excellent reviews. We’ve taken Foundations through three major revisions. Bug Advocacy and Test Design have had two. We’re working on our next major update now. You’ll read more about that in my next post.

The typical instructor-led course is organized around six lectures (about six hours of talk, divided into one-hour parts), with a collection of activities. To successfully complete a typical instructor-led course, a student spends about 12-15 hours per week for 4 weeks (48-60 hours total). Most of the course time is spent on the activities:

  • Orientation activities introduce students to a key challenge considered in a lecture. The student puzzles through the activity for 30 to 90 minutes, typically before watching the lecture, then sees how the lecture approaches this type of problem. The typical Foundations course has two to four of these.
  • Application activities call for two to six hours of work. It applies ideas or techniques presented in a lecture or developed over several lectures. The typical Foundations course has one to two of these.
  • Multiple-choice quizzes help students identify gaps in their knowledge or understanding of key concepts in the course. These questions are tough because they are designed to be instructional or diagnostic (to teach you something, to deepen your knowledge of something, or to help you recognize that you don’t understand something) rather than to fairly grade you.
  • Various other discussions that help the students get to know each other better, chew on the course’s multiple-choice quiz questions, or consider other topics of current interest.
  • An essay-style final exam.

In the instructor-led course, students get feedback on the quality of their work from each other and, to a lesser or greater degree (depends on who’s teaching the course), they get feedback from the instructors. Students in our commercial courses (which we offer through Kaner, Fiedler & Associates) get a lot of feedback. Students in courses taught by unpaid volunteer instructors are more likely to get most of their feedback from the other students.

So, that’s today (in the online course).

However, BBST has actually been around for 20 years.

Background on the BBST Course Design

I started teaching BBST in 1994, with Hung Quoc Nguyen, for the American Society for Quality in Silicon Valley. This was the commercial version of the course (taught to people working as testers). Development of the course was significantly influenced by:

  • Detailed peer reviews of the live class and of the circulating slide decks. The reviews included detailed critiques from colleagues when I made significant course updates (I offered free beta-review classes to test the updates).
  • Co-teaching the material with colleagues. We would learn together by cross-teaching material, often challenging points in each other’s slides or lecture in front of the students. For example, I taught with James Bach, Elisabeth Hendrickson, Doug Hoffman and (for the metrics material) Pat Bond, a professor at Florida Tech.
  • Rational Software, which contracted with me to create a customized version of BBST to support testing under the Rational Unified Process. They criticized the course in detail over several pilot teachings, and allowed me to apply what I learned back to the original course.

In 1999, I decided that if I wanted to learn how to significantly improve the instructional value of the course, I was going to have to see how teachers help students learn complex topics and skills in university. My sense was, and is, that good university instruction goes much deeper and demands more from the students than most commercial training.

Florida Tech hired me in 2000 to teach software engineering and encouraged me to evolve BBST down two parallel tracks:

  • a core university course that would be challenging for our graduate students and a good resume-builder when our students looked for jobs
  • a stronger commercial course that demanded more from the students.

We correctly expected that the two tracks would continually inform each other. Getting feedback from practitioners would help us keep the academic stuff real and useful. Trying out instructional ideas in the classroom would give us ideas for redesigning the learning experience of commercial students.

By 2003, I realized that most of my students were doing most of their learning outside the classroom. They claimed to like my lectures, but they were learning from assignments and discussions that happened out of the classroom. In 2004, I decided to try taping the lectures. The students could watch these at home, while we did activities in the classroom that had previously been done out of class. This went well, and in 2005, I created a full set of course videos.

I used the 2005 videos in my own classes. I put a Creative Commons license on the videos and posted them, along with other supporting materials, on my lab’s website. Rebecca Fiedler and I also started giving talks to educators about our results, such as these two papers (Association for Educational & Communications Technology conference and the Sloan Conference on Asynchronous Learning Networks in 2005).

These days, what we were doing has a name (“flipping“) and the Open Courseware concept is old news. Back then, it was still hard to find examples of other people doing this. Even though many other people were experimenting with the same ideas, not many people were yet publishing and so we had to puzzle through the instructional ideas by reading way too much stuff and thinking way too hard about way too many conflicting opinions and results. We summarized our own design ideas in the 2005 presentations (cited above). A good sample of the literature we were reading appeared in our applications for funding to the National Science Foundation, such as the one that was funded (2007), which gave us the money to pay graduate students to help with evaluation and redesign of the course (yielding the current public version).

For readers interested in the “science” that informed our course design, I’m including an excerpt from the 2007 Grant Proposal at the end of this article.

The Collaboration with AST

While we were sketching the first BBST videos, we were also working to form AST (the Association for Software Testing). AST incorporated in 2004. Perhaps a year later, Rebecca and I decided that the academic version of online BBST could probably be adapted for working testers. The AST activists at that time were among my closest professional friends, so it was natural to bring this idea to them.

We began informally, of course. We started by posting a set of videos on a website, but people kept asking for instructor support—for a “real” class. By this point (late 2006), the Florida Tech course was maturing and I was confident (in retrospect, laughably overconfident) that I could translate what was working in a mixed online-plus-face-to-face university class to a fully online course for practitioners located all over the world. The result worked so badly that everyone dropped out (even the instructors).

We learned a lot from the details of the failure, looked more carefully at how other university instructors had redesigned traditional academic courses to make them effective for remote students who had full-time jobs and who probably hadn’t sat in an academic classroom for several years (so their academic skills were rusty). After a bunch more pilot-testing, I offered the first BBST:Foundations as a one-month class (essentially the modern structure) in October, 2007.

We offered BBST:Foundations through AST, adding BBST:Bug Advocacy in 2008, redoing BBST:Foundations (new slides, videos, etc.) in 2010, and adding BBST:Test Design in 2012.

AST was our learning lab for commercial courseware. Florida Tech’s testing courses, and my graduate research assistants at Florida Tech, were my learning lab for the academic courseware. I would try new ideas at Florida Tech and bring the ones that seemed promising into the AST courses as minor or major updates. All the while, I was publishing the courseware at my lab’s website, testingeducation.org, and encouraging other people to use the material in their courses.

We trained and supervised a crew of volunteer instructors for AST’s BBST, but other people were teaching the course (or parts of it) too. This included professors, commercial trainers, managers teaching their own staff how to test, etc. Becky created an instructor’s course (how to teach BBST), which we offered as an instructor-led course through AST but which we also offered as a free learning experience on the web (study it yourself at your own pace). In 2012, we published a 357-page Instructor’s Manual for BBST. We published the book as a Technical Report (a publication method available to university professors) so that we could supply it to the public for free.

Underlying much of the AST collaboration was a hope that we could create an open courseware community that would function like some of the successful open software communities.

  • In the open source software world, many of the volunteers who maintain and enhance open source software are able to charge people for support services. That is, the software (courseware) is free but if you want support, you have to pay for it. The support money creates an income stream that makes it possible for skilled people to spend time improving the software.
  • We hoped that we could create a similar type of structure for open source courseware (the BBST courses). You can see the thinking, for example, in a 2008 paper that Rebecca and I wrote with Scott Barber, Building a free courseware community around an online software testing curriculum.

It turns out that this is a very complex idea. It is probably too complex for a small professional society that handles most of its affairs pretty informally.

For now, Rebecca and I have formed Kaner, Fiedler & Associates to sustain BBST instead. That is, KFA sells commercial BBST training and the income stream makes it possible for us to make new-and-improved versions of BBST.

AST might also create its own project to maintain and enhance BBST. If so, we’ll probably see the evolution of contrasting designs for the next generations of the courses. We think we’d learn a lot from that dynamic and we hope that it happens.

An Excerpt from our 2007 Grant Proposal

This is from our application for NSF Award CCLI-0717613, Adaptation and Implementation of an Activity-Based Online or Hybrid Course in Software Testing. (When we acknowledge support from NSF, we are required to remind you that National Science Foundation does not endorse any opinions, findings, conclusions or recommendations that arose out of NSF-funded research.) The full application is available online but it is very concisely written, structured according to very specific NSF guidelines, and packed with points that address NSF-specific concerns. Here is the most relevant section of that 56-page document here, in terms of explaining our approach and literature review for the course’s instructional design.

3. Our Current Course (Black Box Software Testing—BBST)

We adopted the new teaching method in Spring 2005 after pilot work in 2004. Our new approach spends precious student contact hours on active learning experiences (more projects, seminars and labs) that involve real-world problems, communication skills, critical thinking, and instructor scaffolding [129, 136] without losing the instructional benefits of polished lectures. Central to a problem-based learning environment is that students focus on “becoming a practitioner, not simply learning about practice” [122, p. 3]

Anderson et al.’s [11] update to Bloom’s taxonomy [20] is two-dimensional, knowledge and cognitive processing.

  • On the Knowledge dimension, the levels are Factual Knowledge (such as the definition of a software testing technique), Conceptual Knowledge (such as the theoretical model that predicts that a given test technique is useful for finding certain kinds of bugs), Procedural Knowledge (how to apply the technique), and Metacognitive Knowledge (example: the tester decides to study new techniques on realizing that the ones s/he currently knows don’t apply well to the current situation.)
  • On the Cognitive Process dimension, the levels are Remembering (such as remembering the name of a software test technique that is described to you), Understanding (such as being able to describe a technique and compare it with another one), Applying (actually doing the technique), Analyzing (from a description of a case in which a test technique was used to find a bug, being able to strip away the irrelevant facts and describe what technique was used and how), Evaluating (such as determining whether a technique was applied well, and defending the answer), and Creating (such as designing a new type of test.).

For most of the material in these classes, we want students to be able to explain it (conceptual knowledge, remembering, understanding), apply it (procedural knowledge, application), explain why their application is a good illustration of how this technique or method should be applied (understanding, application, evaluation), and explain why they would use this technique instead of some other (analysis).

3.1 We organize classes around learning units that typically include:

  • Video lecture and lecture slides. Students watch lectures before coming to class. Lectures can convey the lecturer’s enthusiasm, which improves student satisfaction [158] and provide memorable examples to help students learn complex concepts, tasks, or cultural norms [47, 51, 115]. They are less effective for teaching behavioral skills, promoting higher-level thinking, or changing attitudes or values [19]. In terms of Bloom’s taxonomy [11, 20], lectures would be most appropriate for conveying factual and conceptual knowledge at the remembering and understanding levels. Our students need to learn the material at these levels, but as part of the process of learning how to analyze situations and problems, apply techniques, and evaluate their own work and the work of their peers. Stored lectures are common in distance learning programs [138]. Some students prefer live lectures [45, 121] but on average, students learn as well from video as live lecture [19, 139]. Students can replay videos [53] which can help students whose first language is not English. Web-based lecture segments supplement some computer science courses [34, 44]. Studio-taped, rehearsed lectures with synchronously presented slides (like ours) have been done before [29]. Many instructors tape live lectures, but Day and Foley [30-34] report their students prefer studio-produced lectures over recorded live lectures. We prefer studio-produced lectures because they have no unscripted interruptions and we can edit them to remove errors and digressions.
  • Application to a product under test. Each student joins an open source software project (such as Open Office or Firefox) and files work with the project (such as bug reports in the project’s bug database) that they can show and discuss during employment interviews. This helps make concepts “real” to students by situating them in the development of well-regarded products [118]. It facilitates transfer of knowledge and skills to the workplace, because students are doing the same tasks and facing the same problems they would face with commercial software [25]. As long as the assignments are not too far beyond the skill and knowledge level of the learner, authentic assignments yield positive effects on retention, motivation, and transfer [48, 52, 119, 153].
  • Classroom activities. We teach in a lab with one computer per student. Students work in groups. Activities are open book, open web. The teacher moves from group to group asking questions, giving feedback, or offering supplementary readings that relate to the direction taken by an individual group. Classroom activities vary. Students might apply ideas, practice skills, try out a test tool, explore ideas from lecture, or debate a question from the study guide. Students may present results to the class in the last 15 minutes of the 75-minute class. They often hand in work for (sympathetic) grading: we use activity grades to get attention [141] and give feedback, not for high-stakes assessment. We want students laughing together about their mistakes in activities, not mourning their grades [134].
  • Examples. These supplementary readings or videos illustrate application of a test technique to a shipping product. Worked examples can be powerful teaching tools [25], especially when motivated by real-life situations. They are fundamental for some learning styles [43]. Exemplars play an important role in the development and recollection of simple and complex concepts [23, 126, 146]. The lasting popularity of problem books, such as the Schaum’s Outline series and more complex texts like Sveshnikov [148] attests to the value of example-driven learning, at least for some learners. However, examples are not enough to carry a course. In our initial work under NSF Award EIA-0113539 ITR/SY+PE: Improving the Education of Software Testers, we expected to be able to bring testing students to mastery of some techniques through practice with a broad set of examples. Padmanabhan [113, 132] applied this to domain testing in her Master’s thesis project at Florida Tech, providing students with 18 classroom hours of instruction, including lecture, outlines of ways to solve problems, many practice exercises and exams. Students learned exactly what they were taught. They could solve new problems similar to those solved in class. However, in their final exam, we included a slightly more complicated problem that required them to apply their knowledge in a way that had been described in lecture but not specifically practiced. The students did the same things well, in almost exactly the same ways. However, they all failed to notice problems that should have been obvious to them but that only required a small stretch from their previous drills. This result was a primary motivator for us to redesign the testing course from a lecture course heavy with stories, examples and practice to more heavily emphasize more complex activities.
  • Assigned readings.
  • Assignments, which may come with grading rubrics. These are more complex tasks than in-class activities. Students typically work together over a two-week period.
  • Study guide questions. At the start of the course, we give students a list of 100 questions. All midterm and final exam questions come from this pool. We discuss use and grading of these questions in [60] and make that paper available to students. We encourage group study, especially comparison of competing drafts of answers. We even host study sessions in a café off campus (buying cappuccinos for whoever shows up). We encourage students to work through relevant questions in the guide at each new section of the class. These help self-regulated learners monitor their progress and understanding—and seek additional help as needed. They can focus their studying and appraise the depth and quality of their answers before they write a high-stakes exam. Our experience of our students is consistent with Taraban, Rynearson, & Kerr’s [149]—many students seem not to be very effective readers or studiers, nor very strategic in the way they spend their study time—as a result, they don’t do as well on exams as we believe they could. Our approach gives students time to prepare thoughtful, well-organized, peer-reviewed answers. In turn, this allows us to require thoughtful, well-organized answers on time-limited exams. This maps directly to one of our objectives (tightly focused technical writing). We can also give students complex questions that require time to carefully read and analyze, but that don’t discriminate against students whose first language is not English because these students have the questions well in advance and can seek guidance on the meaning of a question.

Excerpt from the Proposal’s references:

11. Anderson, L.W., Krathwohl, D.R., Airasian, P.W., Cruikshank, K.A., Mayer, R.A., Pintrich, P.R., Raths, J. and Wittrock, M.C. A Taxonomy for Learning, Teaching &  Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, 2001.

19. Bligh, D.A. What’s the Use of Lectures? Jossey-Bass, San Francisco, 2000.

20. Bloom, B.S. (ed.), Taxonomy of Educational Objectives: Book 1 Cognitive Domain. Longman, New York, 1956

23. Brooks, L.R. Non-analytic concept formation and memory for instances. in Rosch, E. and Lloyd, B.B. eds. Cognition and categorization, Erlbaum, Hillsdale, NJ, 1978, 169-211.

25. Clark, R.C. and Mayer, R.E. e-Learning and the Science of Instruction. Jossey-Bass/Pfeiffer, San Francisco, CA, 2003.

29. Dannenberg, R.P. Just-In-Time Lectures, undated.

30. Day, J.A. and Foley, J. Enhancing the classroom learning experience with Web lectures: A quasi-experiment GVU Technical Report GVU-05-30, 2005.

31. Day, J.A. and Foley, J., Evaluating Web Lectures: A Case Study from HCI. in CHI ’06 (Extended Abstracts on Human Factors in Computing Systems), (Quebec, Canada, 2006), ACM Press, 195-200. Retrieved January 4, 2007, from http://www3.cc.gatech.edu/grads/d/Jason.Day/documents/er703-day.pdf

32. Day, J.A., Foley, J., Groeneweg, R. and Van Der Mast, C., Enhancing the classroom learning experience with Web lectures. inInternational Conference of Computers in Education, (Singapore, 2005), 638-641. Retrieved January 4, 2007, fromhttp://www3.cc.gatech.edu/grads/d/Jason.Day/documents/ICCE2005_Day_Short.pdf

33. Day, J.A., Foley, J., Groeneweg, R. and Van Der Mast, C. Enhancing the classroom learning experience with Web lecturesGVU Technical Report GVU-04-18, 2004.

34. Day, J.A. and Foley, J.D. Evaluating a web lecture intervention in a human-computer interaction course. IEEE Transactions on Education49 (4). 420-431. Retrieved December 31, 2006.

43. Felder, R.M. and Silverman, L.K. Learning and teaching styles in engineering education. Engineering Education78 (7). 674-681.

44. Fintan, C., Lecturelets: web based Java enabled lectures. in Proceedings of the 5th annual SIGCSE/SIGCUE ITiCSE Conference on Innovation and technology in computer science education, ( Helsinki, Finland, 2000), 5-8.

45. Firstman, A. A comparison of traditional and television lectures as a means of instruction in biology at a community college., ERIC, 1983.

47. Forsyth, D., R. The Professor’s Guide to Teaching: Psychological Principles and Practices. American Psychological Association, Washington, D.C., 2003.

48. Gagne, E.D., Yekovich, C.W. and Yekovich, F.R. The Cognitive Psychology of School Learning. HarperCollins, New York, 1994.

51. Hamer, L. A folkloristic approach to understanding teachers as storytellers. International Journal of Qualitative Studies in Education12 (4). 363-380, from http://ejournals.ebsco.com/direct.asp?ArticleID=NLAW20N8B16TQKHDEECM.

52. Haskell, R.E. Transfer of learning: Cognition, instruction, and reasoning. Academic Press, San Diego, 2001.

53. He, L., Gupta, A., White, S.A. and Grudin, J. Corporate Deployment of On-demand Video: Usage, Benefits, and Lessons, Microsoft Research, Redmond, WA, 1998, 12.

60. Kaner, C., Assessment in the software testing course. in Workshop on the Teaching of Software Testing (WTST), (Melbourne, FL, 2003), from https://kaner.com/pdfs/AssessmentTestingCourse.pdf

113. Kaner, C. and Padmanabhan, S., Practice and transfer of learning in the teaching of software testing. in Conference on Software Engineering Education & Training, (Dublin, 2007).

115. Kaufman, J.C. and Bristol, A.S. When Allport met Freud: Using anecdotes in the teaching of Psychology. Teaching of Psychology28 (1). 44-46.

118. Lave, J. and Wenger, E. Situated Learning: Legitimate Peripheral Participation. Cambridge University Press, Cambridge, England, 1991.

119. Lesh, R.A. and Lamon, S.J. (eds.). Assessment of authentic performance in school mathematics. AAAS Press, Washington, DC, 1992.

121. Maki, W.S. and Maki, R.H. Multimedia comprehension skill predicts differential outcomes of web-based and lecture courses.Journal of Experimental Psychology: Applied8 (2). 85-98.

122. MaKinster, J.G., Barab, S.A. and Keating, T.M. Design and implementation of an on-line professional development community: A project-based learning approach in a graduate seminar Electronic Journal of Science Education, 2001.

126. Medin, D. and Schaffer, M.M. Context theory of classification learning. Psychological Review85 (207-238)

129. National Panel Report. Greater Expectations: A New Vision for Learning as a Nation Goes to College, Association of American Colleges and Universities, Washington, D.C., 2002.

132. Padmanabhan, S. Domain Testing: Divide & Conquer Department of Computer Sciences, Florida Institute of Technology, Melbourne, FL, 2004.

134. Paris, S.G. Why learner-centered assessment is better than high-stakes testing. in Lambert, N.M. and McCombs, B.L. eds.How Students Learn: Reforming Schools Through Learner-Centered Education, American Psychological Association, Washington, DC, 1998.

136. Project Kaleidoscope. Project Kaleidoscope Report on Reports: Recommendations for Action in support of Undergraduate Science, Technology, Engineering, and Mathematics. Investing in Human Potential: Science and Engineering at the Crossroads, Washington, D.C., 2002. Retrieved January 16, 2006, from http://www.pkal.org/documents/RecommentdationsForActionInSupportOfSTEM.cfm.

138. Rossman, M.H. Successful online teaching using an asynchronous learner discussion forum. Journal of Asynchronous Learning Networks3 (2), from http://www.sloan-c.org/publications/jaln/v3n2/v3n2_rossman.asp.

139. Saba, F. Distance education theory, methodology, and epistemology: A pragmatic paradigm. in Moore, M.G. and Anderson, W.G. eds. Handbook of Distance Education, Lawrence Erlbaum Associates, Mahwah, New Jersey, 2003, 3-20.

141. Savery, J.R. and Duffy, T.M. Problem Based Learning: An Instructional Model and Its Constructivist Framework, Indiana University, Bloomington, IN, 2001.

146. Smith, D.J. Wanted: A New Psychology of Exemplars. Canadian Journal of Psychology59 (1). 47-55

148. Sveshnikov, A.A. Problems in probability theory, mathematical statistics and theory of random functions. Saunders, Philadelphia, 1968

149. Taraban, R., Rynearson, K. and Kerr, M. College students’ academic performance and self-reports of comprehension strategy use. Reading Psychology21 (4). 283-308.

153. Van Merrienboer, J.J.G. Training complex cognitive skills: A four-component instructional design model for technical training. Educational Technology Publications, Englewood Cliffs, NJ, 1997.

158. Williams, R.G. and Ware, J.E. An extended visit with Dr. Fox: Validity of student satisfaction with instruction ratings after repeated exposures to a lecturer. American Educational Research Journal14 (4). 449-457.

Thinking about the fraud against Target

January 22nd, 2014

I read an interesting article in the Wall Street Journal today: http://online.wsj.com/news/articles/SB10001424052702304027204579332990728181278?mod=%3C%25mst.param%28LINKMODPREFIX%29

Basically, the theory presented in the article is that there are these wonderful credit/debit cards with embedded chips that are much more secure than the current system. If only Target (and other retailers) had adopted these, we would have less fraud. Apparently, the fault lies with Target.

I imagine that the expected response to this article is “What were they thinking?” as the reader realizes that more-effective technology was at hand at what might have been a reasonable price.

I got to watch some of this play out in the late 1990’s. At the time, I was working as a technology-focused lawyer and one of the areas I worked on was electronic payment systems. I published a few papers on this. One available from my website appeared in 1998 in the Journal of Electronic Commerce, called “SPLAT! Requirements bugs on the information superhighway“, see https://kaner.com/pdfs/splat.pdf

The issues I wrote about in this (and related papers) involved the use of public-key encryption systems to guarantee identity. The same commercial-liability issues were coming up for chip cards, with the same rationale.

These systems offered the potential of significantly reducing fraud in consumer transactions. Fraud was seen as a big problem. With these savings of billions of dollars of losses, some credit card company representatives spoke of being able to noticeably lower their fees and interest rates. Who wouldn’t want that?

Unfortunately, some financial services firms (and some other folks) saw two opportunities here.

  1. They hated paying money to criminals committing fraud
  2. They hated guaranteeing every credit card transaction in the event of fraud—they wanted to put this risk back on the consumer but current legislation wouldn’t let them

The proposals to adopt encryption-based identification systems in commerce tied these together. The proposed laws would:

  1. authorize the use of encryption-based identification as equivalent to an ink signature
  2. treat the encryption-based identification as absolutely authoritative, so that if someone successfully impersonated you, you would bear all the loss. Current law sticks the financial-services firms with the risk of credit-card fraud losses because they design the system and decide how much security to build into it. The proposed new system would be an alternative to the consumer-protected credit-card system. It would flip the risk to the consumer.

I think legislation would have easily passed that provided incentives to adopt encryption-based identification. For example, the legislation could have created a “rebuttable presumption” — an instruction to a court to assume that a message encrypted with your key came from you and if you wanted to deny that, you would have to prove it.  This legislation would have reduced fraud, which would benefit everyone. (Well, everyone but the criminals…)

Unfortunately, the demand went further. Even if you could prove that you were the victim of identity theft that was in no way your fault, you would still be held accountable for the loss. 

The lawyers advocating for incentivizing encryption-based identification weren’t willing to separate the proposals. The result of their inflexibility was opposition to encryption-based payment-related identification systems (including chip cards). One dimension of the opposition was technical–the security of the payment systems was almost certainly less (and therefore the risk of fraud that was created by the system and not by negligence of the consumer was greater) than the most enthusiastic proponents imagined. Another dimension was irritation with what was perceived as greed and unwillingness to compromise.

Back then, I saw this play out because I was helping a committee write the Uniform Electronic Transactions Act (UETA). This eventually passed in most states and was then federalized under the name ESIGN. ESIGN now governs electronic payments in the United States. The multi-year drafting process that yielded UETA/ESIGN offered a unique opportunity to write incentives for stronger identification systems into the laws governing electronic payments. Instead, we chose to write legislation that accepted a status quo that involved too much fraud, with prospects of much worse fraud to come. I was one of the people who successfully encouraged the UETA drafting committee to take this less-secure route because there was no politically-feasible path to what seemed like the obvious compromise.

Our economy has benefited enormously from legislation that lets you buy something by clicking “I agree”, without having to sign a physical piece of paper with a physical ink-pen. We could have done this better. Instead, we accepted the predictable future outcome that the United States would continue to use insecure payment systems, that would result in ongoing fraud, like the latest attacks on Target, Neiman Marcus, and (apparently, according to recent reports) at least six other national retailers.

On the design of advanced courses in software testing

January 19th, 2014

This year’s Workshop on Teaching Software Testing (WTST 2014) is on teaching advanced courses in software testing. During the workshop, I expect we will compare notes on how we design/evaluate advanced courses in testing and how we recognize people who have completed advanced training.

This post is an overview of one of the two presentations I am planning for WTST.

This presentation will consider the design of the courses. The actual presentation will rely heavily on examples, mainly from BBST (Foundations, Bug Advocacy, Test Design), from our new Domain Testing course, and from some of my non-testing courses, especially statistics and metrics. The slides that go with these notes will appear at the WTST site in late January or early February.

In the education community, a discussion like this would come as part of a discussion of curriculum design. That discussion would look more broadly at the context of the curriculum decisions, often considering several historical, political, socioeconomic, and psychological issues. My discussion is more narrowly focused on the selection of materials, assessment methods and teaching-style tradeoffs in a specialized course in a technical field. The broader issues come into play, but I find it more personally useful to think along six narrower dimensions:

  • content
  • institutional considerations
  • skill development
  • instructional style
  • expectations of student performance
  • credentialing

Content

In terms of the definition of “advanced”, I think the primary agreement in the instructional community is that there is no agreement about the substance of advanced courses. A course can be called advanced if it builds on other courses. Under this institutional definition, the ordering of topics and skills (introductory to advanced) determines what is advanced, but that ordering is often determined by preference or politics rather than by principle.

I am NOT just talking here about fields whose curricula involve a lot of controversy. Let me give an example. I am currently teaching Applied Statistics (Computer Science 2410). This is parallel in prerequisites and difficulty to the Math department’s course on Mathematical Statistics (MTH 2400). When I started teaching this, I made several assumptions about what my students would know, based on years of experience with the (1970’s to 1990’s) Canadian curriculum. I assumed incorrectly that students would learn very early about the axioms underlying algebra—this was often taught as Math 100 (1st course in the university curriculum). Here, it seems common to find that material in 3rd year. I also assumed incorrectly that my students would be very experienced in the basics of proving theorems. Again mistaken, and to my shock, many CS students will graduate, having taken several required math courses, with minimal skills in formal logic or theorem proof. I’m uncomfortable with these choices (in the “somebody moved my cheese” sense of the word “uncomfortable”)—it doesn’t feel right, but I am confident that these students studied other topics instead, topics that I would consider 3rd-year or 4th-year. Even in math, curriculum design is fluid and topics that some of us consider foundational, others consider advanced.

In a field like ours (testing) that is far more encumbered with controversy, there is a strong argument for humility when discussing what is “foundational” and what is “advanced”.

Institutional Considerations

In my experience, one of the challenges in teaching advanced topics is that many students will sign up who lack basic knowledge and skills, or who expect to use this course as an opportunity to relitigate what they learned in their basic course(s). This is a problem in commercial and university courses, but in my experience, it is much easier to manage in a university because of the strength and visibility of the institutional support.

To make space for advanced courses, institutions that designate a courses as advanced are likely to

  • state and enforce prerequisites (courses that must be taken, or knowledge/skill that must be demonstrated before the student can enrol in the advanced course)
  • accept transfer credit (a course can be designated as equivalent to one of the institution’s courses and serve as a prerequisite for the advanced course)

The designation sets expectations. Typically, this gives instructors room to:

  1. limit class time spent rehashing foundational material
  2. address topics that go beyond the foundational material (whatever material this institution has designated as foundational)
  3. tell students who do not know the foundational material (or who cannot apply it to the content of the advanced course) that it is their responsibility to catch up to the rest of the class, not the course’s responsibility to slow down for them
  4. demand an increased level of individual performance from the students (not just work products on harder topics, but better work products that the student produces with less handholding from the instructor)

Note clearly that in an institution like a university, the decisions about what is foundational, what is advanced, and what prerequisites are required for a particular course are made by groups of instructors, not by the administrators of the institution. This is an idealized model–it is common for institutional administrators to push back, encouraging instructors to minimize the number of prerequisites they demand for any particular course and encouraging instructors to take a broader view of equivalence when evaluating transfer credits. But at its core, the administration adopts structures that support the four benefits that I listed above (and probably others). I think this is the essence of what we mean by “protecting the standards” of the institution.

Skill Development

I think of a skill as a type of knowledge that you can apply (you use it, rather than describe it) and your application (your peformance) improves with deliberate practice.

Students don’t just learn content in courses. They learn how to learn, how to investigate and find/create new ideas or knowledge on their own, how to find and understand the technical material of their field, how to critically evaluate ideas and data, how to communicate what they know, how to work with other students, and so on. Every course teaches some of these to some degree. Some courses are focused on these learning skills.

Competent performance in a professional field involves skills that go beyond the learning skills. For example, skills we must often apply in software testing include:

  • many test design techniques (domain testing, specification-based testing, etc.). Testers get better with these through a combination of theoretical instruction, practice, and critical feedback
  • many operational tasks (setting up test systems, running tests, noting what happened)
  • many advanced communication skills (writing that combines technical, persuasive and financial considerations)

Taxonomies like Bloom’s make the distinction between memorizable knowledge and application (which I’m describing as skill here). Some courses, and some exams, are primarily about memorizable knowledge and some are primarily about application.

In general, in my own teaching, I think of courses that focus on memorizable knowledge as survey courses (broad and shallow). I think of survey courses as foundational rather than advanced.

Most survey courses involve some application. The student learns to apply some of the content. In many cases, the student can’t understand the content without learning to apply it at least to simple cases. (In our field, I think domain testing–boundary and equivalence class analysis–is like this.) It seems to me that courses run on a continuum, how much emphasis on learning things you can remember and describe versus learning ways to apply knowledge more effectively. I think of a course that is primarily a survey course as a survey course, even if it includes some application.

Instructional Style

Lecture courses are probably the easiest to design and the easiest to sell. Commercial and university students seem to prefer courses that involve a high proportion of live lecture.

Lectures are effective for introducing students to a field. They introduce vocabulary (not that students remember much of it–they forget most of what they learn in lecture). They convey attitudes and introduce students to the culture of the field. They can give students the sense that this material is approachable and worth studying. And they entertain.

Lectures are poor vehicles for application of the material (there’s little space for students to try things out, get feedback and try them again).

In my experience, they are usually also poor vehicles for critical thinking (evaluating the material). Some lecturers develop a style that demands critical thinking from the students (think of law schools) but I think this requires very strong cultural support. Students understand, in law school, that they will flunk out if they come to class unprepared and are unwilling or unable to present and defend ideas quickly, in response to questions that might come from a professor at any time. Lawyers view the ability to analyze, articulate and defend in real time as a core skill in their field and so this approach to teaching is considered appropriate. In other fields that don’t prioritize oral argumentation so highly, a professor who relied on this teaching style and demanded high performance from every student, would be treated as unusual and perhaps inappropriate.

As students progress from basic to advanced, the core experiences they need to support further progress also change, from lecture to activities that require them to do more–more applications to increasingly complex tasks, more critical evaluation of what they are doing, what others are doing, and what they are being told to do or to accept as correct or wise. Fewer things are correct. More are better-for-these-reasons or better-for-these-purposes.

Expectations of Student Performance

More advanced courses demand that students take more responsibility for the quality of their work:

  • The students expect, and tolerate, less specific instructions. If they don’t understand the instructions, the students understand that it is their responsibility to ask for clarification or to do other research to fill in the blanks.
  • The students don’t expect (or know they are not likely to get) worked examples that they can model their answers from or rubrics (step-by-step evaluation guides) that they can use to structure their answers. These are examples of scaffolding, instructional support structures to help junior students accomplish new things. They are like the training wheels on bicycles. Eventually, students have to learn to ride without them. Not just how to ride down this street for these three blocks, but how to ride anywhere without them. Losing the scaffolding is painful for many students and some students protest emphatically that it is unfair to take these away. I think the trend in many universities has been to provide more scaffolding for longer. This cuts back on student appeals and seems to please accreditors (university evaluators) but I think this delays students’ maturation in their field (and generally in their education).

One of the puzzles of commercial instruction is how to assess student performance. We often think of assessment in terms of passing or failing a course. However, assessment is more broadly important, for giving a student feedback on how well she knows the material or how well she does a task. There has been so much emphasis on high-stakes assessment (you pass or you fail) in academic instruction that many students don’t understand the concept of formative assessment (assessment primarly done to give the student feedback in order to help the student learn). This is a big issue in university instruction too, but my experience is that commercial students are more likely to be upset and offended when they are given tough tasks and told they didn’t perform well on them. My experience is that they will make more vehement demands for training wheels in the name of fairness, without being willing to accept the idea that they will learn more from harder and less-well-specified tasks.

Things are not so well specified at work. More advanced instruction prepares students more effectively for the uncertainties and demands of real life. I believe that preparation involves putting students into uncertain and demanding situations, helping them accept this as normal, and helping them learn to cope with situations like these more effectively.

Credentialing

Several groups offer credentials in our field. I wrote recently about credentialing in software testing at https://kaner.com/?p=317. My thoughts on that will come in a separate note to WTST participants, and a separate presentation.

Last call for WTST 2014

November 24th, 2013

This year’s Workshop on Teaching Software Testing is focused on designing and teaching advanced courses in software testing. It is in sunny Florida, in late January 2014. Right after WTST, we will teach a 5-day pilot of the Domain Testing course. You can apply to attend either one.

We expect the WTST discussion to flow down two paths. At this point, we are not sure which will dominate:

1. What are the characteristics of a genuinely “advanced” testing course?

What are people teaching or designing at this level and what design decisions and assessment decisions are they making? What courses should we be designing?

2. What should the characteristics be for an advanced certification in software testing?

I’ve been criticizing the low bar set by ISTQB’s, QAI’s, and ASQ’s certifications for over 15 years. From about 1996 to (maybe it was) 2003, I worked with several colleagues on ideas for a better certification. As I pointed out recently, those ideas failed. We couldn’t find a cost-effective solution that met our quality standards. I moved on to other challenges, such as creating the BBST series. Some others adopted a more critical posture toward certification in general.

Looking back, I think the same problems that motivated thousands of testers (and employers) to seek a credentialing system for software testers are still with us. The question, I think, is not whether we need a good credentialing system. The question is whether we can get a credentialing system that is good.

From some discussions about advanced course design, I think we are likely to see a discussion of advanced credentialing at WTST. The idea that ties this discussion to WTST is that the credential would be based at least partially on performance in advanced courses.

I don’t know whether this discussion will go very far, whether it will be a big part of the meeting itself or just the after-meeting dinners, or whether anyone will come to any agreements. But if you are interested in participating in a constructive discussion about a very hard problem, this might be a good meeting.

To apply to come to WTST, please send me a note (kaner@cs.fit.edu).

For more information about WTST, see http://wtst.org/. For more on the first pilot teaching of the Domain Testing course, which we will teach immediately following WTST, see http://bbst.info.

The “Failure” of Udacity

November 23rd, 2013

If you are not aware of it, Udacity is a huge provider of a type of online courses called MOOCs (Massive Open Online Courses). Recently, a founder of Udacity announced that he was disappointed in Udacity’s educational results and was shifting gears from general education to corporate training.

I was brought into some discussions of this among academics and students. A friend suggested that I slightly revise one of my emails for general readership on my blog. So here goes.

My note is specifically a reaction to two articles:

Udacity offers free or cheap courses. My understanding is that it has a completion rate of 10% (only 10% of the students who start, finish) and a pass rate of 5%. This is not a surprising number. Before there were MOOCs, numbers like this were reported for other types of online education in which students set their own pace or worked with little direct interaction with the instructor. For example, I heard that Open University (a school for which I have a lot of respect) had numbers like this.

I am not sure that 10% (or 5%) is a bad rate. If the result is that thousands of people get opportunities that they would otherwise not have had, that’s an important benefit—even if only 5% find the time to make full use of those opportunities.

In general, I’m a fan of open education. When I interviewed for a professorship at Florida Tech in 1999, I presented my goal of creating open courseware for software testing (and software engineering education generally). NSF funded this in 2001. The result has been the BBST course series, used around the world in commercial and academic courses.

Software testing is a great example of the need for courses and courseware that don’t fit within the traditional university stream. I don’t believe that we will see good undergraduate degree programs in software testing. Instead, advanced testing-specific education will come from training companies and professional societies, perhaps under the supervision/guidance of some nonprofits formed for this purpose, either in universities (like my group, the Center for Software Testing Education & Research) or in the commercial space (like ISTQB). As I wrote in a recent post, I believe we have to develop a better credentialing system for software testing.

We are going to talk about this in the Workshop on Teaching Software Testing (WTST 13, January 24-26, 2014). The workshop is focused on Teaching Advanced Courses in Software Testing. It seems clear from preparatory discussions that this topic will be a springboard for discussions of advanced credentials.

Back to the MOOCs.

Udacity (and others) have earned some ill-will in the instructional community. There have been several types of irritants, such as:

  • Some advocates of MOOCs have pushed the idea that MOOCs will eliminate most teaching positions. After all, if you can get a course from one of the world’s best teachers, why settle for second best? The problem with this is that it assumes that teaching = lectures. For most students, this is not true. Students learn by doing things and getting feedback. By writing essays and getting feedback. By writing code and getting feedback. By designing tests and getting feedback. The student activities—running them, coaching students through them, critiquing student work, suggesting follow-up activities for individuals to try next—do not easily scale. I spent about 15 hours this week in face-to-face meetings with individual students, coaching them on statistical analysis or software testing. Next week I will spend about 15 hours in face-to-face meetings with local students or Skype sessions with online students. This is hard work for me, but my students tell me they learn a lot from this. When people dismiss the enormous work that good teachers spend creating and supporting feedback loops for their students—especially when people who stand to make money from convincing customers and investors that this work is irrelevant—those teachers sometimes get annoyed.
  • Some advocates of MOOCs, and several politicians and news columnists, have pushed the idea that this type of education can replace university education. After all, if you can educate a million students at the same time (with one of the world’s best teachers, no less), why bother going to a brick-and-mortar institution? It is this argument that fails when 95% of the students flunk out or drop out. But I think it fails worse when you consider what these students are learning. How hard are the tests they are taking or the assignments they are submitting? How carefully graded is the work—not just how accurate is the grading, though that can certainly be a big issue with computerized grading—but also, how informative is the feedback from grading? Students pay attention to what you tell them about their work. They learn a lot from that, if you give them something to learn from. My impression is that many of the tests/exams are superficial and that much of the feedback is limited and mechanical. When university teachers give this quality of feedback, students complain. They know they should get better than that at school.
  • Proponents of MOOCs typically ignore or dismiss the social nature of education. Students learn a lot from each other. Back when I paid attention to the instructional-research literature, I used to read studies that reported graduating students saying they learned more from each other than from the professors. There are discussion forums in many (most? all?) MOOCs, but from what I’ve seen and been told by others, these are rarely or never well moderated. A skilled instructor keeps forum discussions on track, moves off-topic posts to another forum, asks questions, challenges weak answers, suggests readings and follow-up activities. I haven’t seen or heard of that in the MOOCs.

As far as I can tell, in the typical MOOC course, students get lectures that may have been fantastically expensive to create, but they get little engagement in the course beyond the lectures. They are getting essentially-unsupervised online instruction. And that “instruction” seems to be a technologically-fancier way of reading a book. A fixed set of material flows from the source (the book or the video lecture) to the student. There are cheaper, simpler, and faster ways to read a book.

My original vision for the BBST series was much like this. But by 2006, I had abandoned the idea of essentially-unsupervised online instruction and started working on the next generation of BBST, which would require much more teacher-with-student and student-with-student engagement.

There has been relentless (and well-funded) hype and political pressure to drive universities to offer credit for courses completed on Udacity and platforms like it. Some schools have succumbed to the pressure.

The political pressure on universities that arises from this model has been to push us to lower standards:

  • lower standards of interaction (students can be nameless cattle herded into courses where no one will pay attention to you)
  • lower standards of knowledge expectation (trivial, superficial machine grading of the kind that can scale to a mass audience)
  • lower standards of instructional design (good design starts from considering what students should learn and how to shepherd them through experiences that will help them achieve those learning objectives. Lecture plans are not instructional design, even if the lectures are well-funded, entertaining and glitzy.)

Online instruction doesn’t have to be simplistic, but when all that the public see in the press is well-funded hype that pushes technoglitz over instructional quality, people compare what they see with what is repeated uncritically as if it was news.

The face-to-face model of instruction doesn’t scale well enough to meet America’s (or the world’s) socioeconomic needs. We need new models. I believe that online instruction has the potential to be the platform on which we can develop the new models. But the commoditizing of the instructor and the cattle-herding of the students that have been offered by the likes of Udacity are almost certainly not the answer.

Quality – which I measure by how much students learn – costs money. Personal interaction between students and instructors, significant assignments that get carefully graded and detailed feedback – costs money. It is easy to hire cheap assistants or unqualified adjuncts but it takes more than a warm body to provide high quality feedback. (There are qualified adjuncts, but the law of supply and demand has an effect when adjunct pay is low.)

The real cost of education is not the money. Yes, that is hugely significant. But it is as nothing compared to the years of life that students sacrifice to get an education. The cost of time wasted is irrecoverable.

In the academic world, there are some excellent online courses and there has been a lot of research on instructional effectiveness in these courses. Many online courses are more effective—students learn more—than face-to-face courses that cover the same material. But these are also more intense, for the teacher and the students. The students, and their teachers, work harder.

Becky Fiedler and I formed Kaner Fiedler Associates to support the next generation of BBST courses. We started the BBST effort with a MOOC-like vision of a structure that offers something for almost nothing. Our understanding evolved as we created generations of open courseware.

I think we can create high-quality online education that costs less than traditional schooling. I think we can improve the ways institutions recognize students’ preexisting knowledge, reducing the cost (but not the quality) of credentials. But cost-reducing and value-improvement does not mean “free” or even “cheap.” The price has to be high enough to sustain the course development, the course maintenance, and the costs of training, providing and supervising good instructors. There is, as far as we can tell, no good substitute for this.

New Book: The Domain Testing Workbook

November 16th, 2013

Sowmya Padmanabhan, Doug Hoffman and I just published a new book together, The Domain Testing Workbook.

The book focuses on one (1) testing technique. Domain Testing is the name of a generalized approach to Equivalence Class Analysis and Boundary Testing. Our goal is to help you develop skill with this one technique. There are plenty of overviews of this technique: it is the most widely taught, and probably the best understood, technique in the field. However, we’re not aware of any other presentations that are focused on helping you go beyond an understanding of the technique to achieving the ability to use it competently.

This is the first of a series of books that are coming along slowly. Probably the next one will be on scenario testing, with Dr. Rebecca Fiedler, merging her deep knowledge of qualitative methodology with my hands-on use of scenarios in software design, software testing, and software-related human factors. Also in the works are books on risk-based testing and specification-based testing. We’ve been working on all of these for years. We learned much from the Domain Testing Workbook about how challenging it is to write a book with a development-of-skill instructional design goal. At this point, we have no idea what the schedule is for the next three. When the soup is ready, we’ll serve it.

This work comes out of a research proposal that I sent to the United States’ National Science Foundation (NSF) in 2001 (Improving the Education of Software Testers), which had, among its objectives:

  • “Identify skills involved in software testing.”
  • “Identify types of exercises that support the development of specific testing skills.”
  • “Create and publish a collection of reusable materials for exercises and tests.”
  • “Prototype a web-based course on software testing.”
  • “Create a series of workshops that focus on the teaching of software testing”

NSF approved the project, which made it possible for us to open the Center for Software Testing Education & Research (CSTER). The web-based course we had in mind is now available as the BBST series. The Workshops on Teaching Software Testing are now in their 13th year. And the Domain Testing Workbook is our primary contribution focused on “exercises that support the development of specific testing skills.”

When we started this project, we knew that domain testing would be the easiest technique to write this kind of book about. Over the years, we had two surprises that caused us to fundamentally rethink the structure of this book (and of other books that I’m still working on that are focused on scenario testing, risk-based testing, and specification-based testing):

  • Our first surprise (or better put, our first shock) was that we could teach students a broad collection of examples of the application of domain testing, confirm that they could apply what they had learned to similar problems, and yet these students would fail miserably when we gave them a slightly more complex problem that was a little different than they had seen before. This was the core finding of Sowmya’s M.Sc. thesis. What we had tripped over was the transfer problem, which is probably the central instructional-design problem in Science, Technology, Engineering & Mathematics (STEM) instruction. The classic example is of the student who did well in her or his calculus course but cannot figure out how to apply calculus to problems (like modeling acceleration) in their introductory physics course. These students can’t transfer what they learned in one course to another course or to more complex situations (such as using it at their job). We had believed that we could work around the transfer problem by building our instruction around realistic exercises/examples. We were mistaken.
    • Ultimately, we concluded that teaching through exercises is still our best shot at helping people develop skill, but that we needed to provide a conceptual structure for the exercises that could give students a strategy for approaching new problems. We created a schema—an 18-step cognitive structure that describes how we do a domain analysis—and we present every exercise and every example in the context of that schema. We haven’t done the formal, experimental research to check this that Sowmya was able to do with our initial approach—an experiment like that is time-consuming and expensive and our funding for that type of work ran out long ago. However, we have inflicted many drafts of the schema on our students at Florida Tech and we believe it improves their performance and organizes their approach to tasks like exams.
  • Our next surprise was that domain testing is harder to apply than we expected. Doug and I are experienced with this technique. We think we’re pretty good at it, and we’ve thought that for a long time. Many other testers perceive us that way too. For example, Testing Computer Software opens with an example of domain testing and talks about the technique in more detail throughout the book. That presentation has been well received. So, when we decided to write a book with a few dozen examples that increased in complexity, we were confident that we could work through the examples quickly. It might take more time to write our analysis in a way that readers could understand, but doing the analysis would be straightforward. We were sooooooo wrong. Yes, we could quickly get to the core of the problem, identifying how we would approach identifying equivalence classes and boundaries (or non-boundary most-powerful test cases) for each class. Yes, we could quickly list several good tests. We got far enough along that we could impress other people with what we knew and how quickly we could get there, but not far enough to complete the problem. We were often getting through about a third of the analysis before getting stuck. Doug would fly to Palm Bay (Florida, where I live) and we would work problems. Some problems that we expected to take a day to solve and explain took us a week.
    • As we came to understand what skills and knowledge we were actually applying when we slogged through the harder problems, we added more description of our background knowledge to the book. A presentation of our Domain Testing Schema—and the thinking behind it—grew from an expected length of about 30 pages to 200. Our presentation of 30 worked examples grew from an expected length of maybe 120 pages (4 pages each, with diagrams) to 190 pages.

We got a lot of help with the book. Our acknowledgments list 91 people who helped us think through the ideas in the book. Many others helped indirectly, such as many participants in the WTST workshops who taught us critical lessons about the instructional issues we were beating our heads against.

Perhaps the main advance in the clarity of the presentation came out of gentle-but-firm, collegial prodding by Paul Gerrard. Paul’s point was that domain testing is really a way for a tester to model the software. The tests that come out of the analysis are not necessarily tied to the actual code. They are tied to the tester’s mental model of the code. Our first reactions to these comments were that Paul was saying something obvious. But over time we came to understand his point—it might be obvious to us, but presentations of domain testing, including ours, were doing an inadequate job of making it obvious to our readers. This led us to import the idea of a notional variable from financial analysis as a way of describing the type of model that domain testing was leading us to make. We wrote in the book:

“A notional variable reflects the tester’s “notion” of what is in the code. The tester analyzes the program as if this variable was part of the code…. The notional variable is part of a model, created by the tester, to describe how the program behaves or should behave. The model won’t be perfect. That exact variable is probably not in the code. However, the notional variable is probably functionally equivalent to (works the same way as) a variable that is in the code or functionally equivalent to a group of variables that interact to produce the same behavior as this variable.”

This in turn helped us rethink parts of the Schema in ways that improved its clarity. And it helped us clarify our thinking about domain-related exploratory testing as a way of refining the modeling of the notional variables (bringing them into increasingly accurate alignment with the program implementation or, if the implementation is inadequate, with how the program should behave).

I hope you have a chance to read The Domain Testing Workbook and if you do, that you’ll post your reactions here or as a review on Amazon.

The 2014 Workshop on Teaching Software Testing is Expanded

October 28th, 2013

The 2014 Workshop on Teaching Software Testing is scheduled for January 24-26 in Melbourne, FL This year’s focus will be on advanced courses in software testing. You can read the details at http://wtst.org.

In addition, we will immediately follow WTST with the first live pilot of the Domain Testing class. This will be our first of the next generation of BBST classes. Rebecca Fiedler and I have been working on a new course design and we plan to use some new technology. Our first pilot will run in Melbourne from Jan 27-31, 2014. Our second live pilot is scheduled for May 12-16 at DeveloperTown in Indianapolis.

Are you interested in attending one of the pilot courses? If so, please apply here. If you are accepted as a Domain Testing Workshop participant, there will be a non-refundable deposit of $100 to defray the cost of meals, snacks, and coffee service. We’ll publish more details about the pilot courses as they become available.

 

Teaching a New Statistics Course: I need your recommendations

June 20th, 2013

Starting this Fall, I will be teaching Applied Statistics to undergraduate students majoring in Computer Science and in Software Engineering. I am designing the course around real-life examples (published papers, conference talks, blogged case studies, etc.) of the use of probability and statistics in our field. I need examples.

I prefer examples that show an appropriate use of a model or a statistic, but I will be glad to also teach from a few papers that use a model or statistic in a clearly invalid way. If you send me a paper, or a link to one, I would appreciate it if you would tell me what you think is good about it (for a stats student to study) and why you think that.

Background of the Students

The typical student in the course has studied discrete math, including some combinatorics, and has at least two semesters of calculus. Only some students will have multivariable calculus.

By this point in their studies, most of the students have taken courses in psychology, logic, programming (up to at least Algorithms & Data Structures), two laboratory courses in an experimental science (chemistry, physics, & biology), and some humanities courses, including a course on research sources/systems (how to navigate the research literature).

My Approach to the Course

In one semester, applied stats courses often cover the concept of probability, discrete distributions, continuous distributions, descriptive statistics, basic inferential statistics and an introduction to stochastic models. The treatment of these topics is often primarily theoretical, with close attention to the underlying theorems and to derivation of the attributes of the distributions and their key statistics. Here are three examples of frequently-used texts for these courses. I chose these because you can see their table of contents on the amazon site.

I’d rather teach a more applied course. Rather than working through the topics in a clear linear order, I’d like to start with a list of 25 examples (case studies) and teach enough theory and enough computation to understand each case.

For example, consider a technical support department receiving customer calls. How many staff do they need? How long will customers have to wait before their call is answered? To answer questions like these, students would learn about the Erlang distributions. I’d want them to learn some of the underlying mathematics of the distribution, the underlying model (see this description of the model for example) and how this practical model connects to the mathematical model, and gain some experience with the distribution via simulation.

Examples Wanted

The biggest complaint that students (at Florida Tech and at several other schools, long ago) have voiced to me about statistics is that they don’t understand how the math applies to anything that they care about now or will ever care about. They don’t see the application to their field or to their personal work. Because of that, I have a strong preference for examples from Computer Science / Software Engineering / Information Security / Computer Information Systems.

Several of my students will also relate well to biostatistics examples. Some will work well with quantitative finance.

In the ideal course (a few years from now), a section of the class that focuses on a specific statistic, a specific model, or a specific type problem will have links to several papers that cover essentially the same concepts but in different disciplines. Once I have a computing-related primary example of a topic, I can flesh it out with related examples from other fields.

Please send suggestions either as comments on this blog or by email to me. A typical suggestion would include

  • a link to a paper, a presentation, a blog post or some other source that I can reach and read.
  • comments from you identifying what part of the material I should teach
  • comments from you explaining why this is a noteworthy example
  • any other notes on how it might be taught, why it is important, etc.

Thanks for your help!

 

Credentialing in Software Testing: Elaborating on my STPCon Keynote

May 9th, 2013

A couple of weeks ago, I talked about the state of software testing education (and software testing certification) in the keynote panel at STPCon. My comments on high-volume test automation and qualitative methods were more widely noticed, but I think the educational issues are more significant.

Here is a summary:

  1. The North American educational systems are in a state of transition.
  2. We might see a decoupling of formal instruction from credentialing.
  3. We are likely to see a dispersion of credentialing—-more organizations will issue more diverse credentials.
  4. Industrial credentials are likely to play a more significant role in the American economy (and probably have an increased or continued-high influence in many other places).

If these four predictions are accurate, then we have thinking to do about the kinds of credentialing available to software testers.

Transition

For much of the American population, the traditional university model is financially unsustainable. We are on the verge of a national credit crisis because of the immensity of student loan debt.

As a society, we are experimenting with a diverse set of instructional systems, including:

  • MOOCs (massive open online courses)
  • Traditionally-structured online courses with an enormous diversity of standards
  • Low-cost face-to-face courses (e.g. community colleges)
  • Industrial courses that are accepted for university credit
  • Traditional face-to-face courses

Across these, we see the full range from easy to hard, from no engagement with the instructor to intense personal engagement, from little student activity and little meaningful feedback to lots of both. There is huge diversity of standards between course structures and institutions and significant diversity within institutions.

  • Many courses are essentially self-study. Students learn from a book or a lecturer but they get no significant assignments, feedback or assessments. Many people can learn some topics this way. Some people can learn many topics this way. For most people, this isn’t a complete solution, but it could be a partial one.
  • Some of my students prosper most when I give them free rein, friendly feedback and low risk. In an environment that is supportive, provides personalized feedback by a human, but is not demanding, some students will take advantage of the flexibility by doing nothing, some students will get lost, and some students will do their best work.
  • The students who don’t do well in a low-demand situation often do better in a higher-demand course, and in my experience, many students need both—-flexibility in fields that capture their imagination and structure/demand in fields that are less engrossing or that a little farther beyond the student’s current knowledge/ability than she can comfortably stretch to.

There is increasing (enormous) political pressure to allow students to take really-inexpensive MOOCs and get course credit for these at more expensive universities. More generally, there is increasing pressure to allow students to transfer courses across institutions. Most universities allow students to transfer in a few courses, but they impose limits in order to ensure that they transfer their culture to their students and to protect their standards. However, I suspect strongly that the traditional limits are about to collapse. The traditional model is financially unsustainable and so, somewhere, somehow, it has to crack. We will see a few reputable universities pressured (or legislated) into accepting many more credits. Once a few do it, others will follow.

In a situation like this, schools will have to find some other way to preserve their standards—-their reputations, and thus the value of their degree for their graduates.

Seems likely to me that some schools will start offering degrees based on students’ performance on exit exams.

  • A high-standards institution might give a long and complex set of exams. Imagine paying $15,000 to take the exam series (and get grades and feedback) and another $15,000 if you pass, to get the degree.
  • At the other extreme, an institution might offer a suite of multiple-guess exams that can be machine-graded at a much lower cost.

The credibility of the degree would depend on the reputation of the exam (determined by “standards” combined with a bunch of marketing).

Once this system got working, we might see students take a series of courses (from a diverse collection of providers) and then take several degrees.

Maybe things won’t happen this way. But the traditional system is financially unsustainable. Something will have to change, and not just a little.

Decoupling Instruction from Credentialing

The vision above reflects a complete decoupling of instruction from credentialing. It might not be this extreme, but any level of decoupling creates new credentialing pressures / opportunities in industrial settings.

Instruction

Instruction consists of the courses, the coaching, the internships, and any other activities the students engage in to learn.

Credentialing

Credentials are independently-verifiable evidence that a person has some attribute, such as a skill, a type of knowledge, or a privilege.

There are several types of credentials:

  • A certification attests to some level of competency or privilege. For example,
    • A license to practice law, or to do plumbing, is a certification.
    • An organization might certify a person as competent to repair their equipment.
    • An organization might certify that, in their opinion, a person is competent to practice a profession.
  • A certificate attests that someone completed an activity
    • A certificate of completion of a course is a certificate
    • A university degree is a certificate
  • There are also formal recognitions (I’m sure there’s a better name for this…)
    • Awards from professional societies are recognitions
    • Granting someone an advanced type of membership (Senior Member or Fellow) in a professional society is a recognition
    • Election to some organizations (such as the American Law Institute or the Royal Academy of Science) is a recognition
    • I think I would class medals in this group
  • There are peer recognitions
    • Think of the nice things people say about you on Linked-In or Entaggle
  • There are workproducts or results of work that are seen as honors
    • You have published X many publications
    • You worked on the development team for X

The primary credentials issued by universities are certificates (degrees). Sometimes, those are also certifications.

Dispersion of Credentialing

Anyone can issue a credential. However, the prestige, credibility, and power of credentials vary enormously.

  • If you need a specific credential to practice a profession, then no matter who endorses some other credential, or how nicely named that other credential is, it still won’t entitle you to practice that profession.
  • Advertising that you have a specific credential might make you seem more prestigious to some people and less prestigious to other people.

It is already the case that university degrees vary enormously in meaning and prestige. As schools further decouple instruction from degrees, I suspect that this variation will be taken even more seriously. Students of mine from Asia, and some consultants, tell me this is already the case in some Asian countries. Because of the enormous variation in quality among universities, and the large number of universities, a professional certificate or certification is often taken more seriously than a degree from a university that an employer does not know and respect.

Industrial Credentials

How does this relate to software testing? Well, if my analysis is correct (and it might well not be), then we’ll see an increase in the importance and value of credentialing by private organizations (companies, rather than universities).

I don’t believe that we’ll see a universally-accepted credential for software testers. The field is too diverse and the divisions in the field are too deep.

I hope we’ll see several credentialing systems that operate in parallel, reflecting different visions of what people should know, what they should believe, what they should be able to do, what agreements they are willing to make (and be bound by) in terms of professional ethics, and what methods of assessing these things are appropriate and in what depth.

Rather than seeing these as mutually-exclusive competing standards, I imagine that some people will choose to obtain several credentials.

A Few Comments On Our Current State

Software Testing has several types of credentials today. Here are notes on a few. I am intentionally skipping several that feel (to me) redundant with these or about which I have nothing useful to say. My goal is to trigger thoughts, not survey the field.

ISTQB

ISTQB is currently the leading provider of testing certifications in the world. ISTQB is the front end of a community that creates and sells courseware, courses, exams and credentials that align with their vision of the software testing field and the role of education within it. I am not personally fond of the Body of Knowledge that ISTQB bases its exams on. Nor am I fond of their approach to examinations (standardized tests that, to my eyes, emphasize memorization over comprehension and skill). I think they should call their credentials certificates rather than certifications. And my opinion of their marketing efforts is that they are probably not legally actionable, but I think they are misleading. (Apart from those minor flaws, I think ISTQB’s leadership includes many nice people.)

It seems to me that the right way to deal with ISTQB is to treat them as a participant in a marketplace. They sell what they sell. The best way to beat it is to sell something better. Some people are surprised to hear me say that because I have published plenty of criticisms of ISTQB. I think there is lots to criticize. But at some point, adding more criticism is just waste. Or worse, distraction. People are buying ISTQB credentials because they perceive a need. Their perception is often legitimate. If ISTQB is the best credential available to fill their need, they’ll buy it. So, to ISTQB’s critics, I offer this suggestion.

Industrial credentialing will probably get more important, not less important, over the next 20 years. Rather than wasting everyone’s time whining about the shortcomings of current credentials, do the work needed to create a viable alternative.

Before ending my comments on ISTQB, let me note some personal history.

Before ASTQB (American ISTQB) formed, a group of senior people in the community invited me into a series of meetings focused on creating a training-and-credentialing business in the United States. This was a private meeting, so I’m not going to say who sponsored it. The discussion revolved around a goal of providing one or more certification-like credentials for software testers that would be (this is my summary-list, not theirs, but I think it reflects their goals):

  • reasonably attainable (people could affort to get the credential, and reasonably smart people who worked hard could earn it),
  • credible (intellectually and professionally supported by senior people in the field who have earned good reputations),
  • scalable (it is feasible to build an infrastructure to provide the relevant training and assessment to many people), and
  • commercially viable (sufficient income to support instructors, maintainers of the courseware and associated documentation, assessors (such as graders of the students and evaluators of the courses), some level of marketing (because a credential that no one knows about isn’t worth much), and in the case of this group, money left over for profit. Note that many dimensions of “commercial viability” come into play even if there is absolutely no profit motive—-the effort has to support itself, somehow).

I think these are reasonable requirements for a strong credential of this kind.

By this point, ISEB (the precursor to ISTQB) had achieved significant commercial success and gained wide acceptance. It was on people’s minds, but the committee gave me plenty of time to speak:

  • I talked about multiple-choice exams and why I didn’t like them.
  • I talked about the desirability of skill-based exams like Cisco’s, and the challenges of creating courses to support preparation for those types of exams.
  • I talked about some of the thinking that some of us had done on how to create a skill-based cert for testers, especially back when we were writing Lessons Learned.

But there was a problem in this. My pals and I had lots of scattered ideas about how to create the kind of certification system that we would like, but we had never figured out how to make it practical. The ideas that I thought were really good were unscalable or too expensive. And we knew it. If you ask today why there is no certification for context-driven testing, you might hear a lot of reasons, including principled-sounding attacks on the whole notion of certification. But back then, the only reason we didn’t have a context-driven certification was that we had no idea how to create one that we could believe in.

So, what I could not provide to the committee was a reasonably attainable, credible, scalable, commercially viable system—-or a plan to create one.

The committee, quite reasonably, chose to seek a practical path toward a credential that they could actually create. I left the committee. I was not party to their later discussions, but I was not surprised that ASTQB formed and some of these folks chose to work with it. I have never forgotten that they gave me every chance to propose an alternative and I did not have a practical alternative to propose.

(Not long after that, I started an alternative project, Open Certification, to see if we could implement some of my ideas. We did a lot of work in that project, but it failed. They really weren’t practical. We learned a lot, which in turn helped me create great courseware—-BBST—-and other ideas about certification that I might talk about more in the future. But the point that I am trying to emphasize here is that the people who founded ASTQB were open to better ideas, but they didn’t get them. I don’t see a reason to be outraged against them for that.)

The Old Boys’ Club

To some degree, your advancement in a profession is not based on what you know. It’s based on who you know and how much they like you.

We have several systems that record who likes like you, including commercial ones (LinkedIn), noncommercial ones (Entaggle), and various types of marketing structures created by individuals or businesses.

There are advantages and disadvantages to systems based on whether the “right” people like you. Networking will never go away, and never should, but it seems to me that

Credentials based on what you know, what you can do, or what you have actually done are a lot more egalitarian than those based on who says they respect you.

I value personal references and referrals, but I think that reliance on these as our main credentialing system is a sure path to cronyism and an enemy of independent thinking.

My impression is that some people in the community have become big fans of reputation-systems as the field’s primary source of credentials. In at least some of the specific cases, I think the individuals would have liked the system a whole lot less when they were less influential.

Miagi-do

I’ve been delighted to see that the Miagi-do school has finally come public.

Michael Larsen states a key view succinctly:

I distrust any certification or course of study that doesn’t, in some way, actually have a tester demonstrate their skills, or have a chance to defend their reasoning or rationale behind those skills.

In terms of the four criteria that I mentioned above, I think this approach is probably reasonably attainable, and to me, it is definitely credible. Whether it scalable and commercially viable has yet to be seen.

I think this is a clear and important alternative to ISTQB-style credentialing. I hope it is successful.

Other Ideas on the Horizon

There are other ideas on the horizon. I’m aware of a few of them and there are undoubtedly many others.

It is easy to criticize any specific credentialing system. All of them, now known or coming soon, have flaws.

What I am suggesting here is:

  • Industrial credentialing is likely to get more important whether you like it or not.
  • If you don’t like the current options, complaining won’t do much good. If you want to improve things, create something better.


This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.

DOMA–Time for a change

March 28th, 2013

standformarriage

 

 

 

 

 

 

Some people have asked me why I posted this graphic. It is a token of solidarity with the Human Rights Campaign. This is one of several images that HRC supporters posted on their websites while the United States Supreme Court was hearing arguments about the Defense of Marriage Act.

Presentation on software metrics

March 26th, 2013

Most software metrics are terrible. They are as overhyped as they are poorly researched.

But I think it’s part of the story of humanity that we’ve always worked with imperfect tools and always will. We succeed by learning the strengths, weaknesses and risks of our tools, improving them when we can, and mitigating their risks.

So how do we deal with this in a way that manages risk while getting useful information to people we care about?

I don’t think there are easy answers to this, but I think several people in the field are grappling with this in a constructive way. I’ve seen several intriguing conference-talk descriptions in the last few months and hope to post comments that praise some of them later. But for now, here’s my latest set of notes on this: https://kaner.com/pdfs/PracticalApproachToSoftwareMetrics.pdf

 

Follow-Up Testing in the BBST Bug Advocacy Course

February 10th, 2013

Becky and I are working on a new version of the BBST courses (BBST-2014). In the interim, to support the universities teaching the course, the AST, and the many people who are studying the materials on their own, we’re publishing some of the ideas we have for clarifying the course. The first two in this series focused on oracles in the Foundations course and on interactive grading as a way to give students more detailed feedback. This post is about follow-up testing, which we cover in the current BBST-BA’s slides 44-68.

Some failures are significant and they occur under circumstances that seem fairly commonplace. These will either be fixed, or there will be a compelling reason to not fix them.

Other failures need work. When you see them, there’s ambiguity about the significance of the underlying bug:

  • Some failures look minor. Underlying the failure is a bug that might (or might not) be more serious than you’d imagine from the first failure.
  • Some failures look rare or isolated. Underlying that failure is a bug that might (or might not) cause failures more often, or under a wider range of conditions, than you’d imagine from the first failure.

Investigating whether a minor-looking failure reflects a more serious bug

To find out whether the underlying bugs are more serious than the first failures make them look, we can do follow-up testing. But what tests?

That’s a question that Jack Falk, Hung Quoc Nguyen and I wrestled with many times. You can see a list of suggestions in Testing Computer Software. Slides 44-56 sort those ideas (and add a few) into four categories:

  1. Vary your behavior
  2. Vary the options and settings of the program
  3. Vary data that you load into the program
  4. Vary the software and hardware environment

Some students get confused by the categories (which is not shocking, because the categories are a somewhat idiosyncratic attempt to organize a broad collection of test ideas), confused enough that they don’t do a good job on the exam of generating test ideas from the categories.

For example, they have trouble with this question:

Suppose that you find a reproducible failure that doesn’t look very serious.

  • Describe the four tactics presented in the lecture for testing whether the defect is more serious than it first appeared.
  • As a particular example, suppose that the display got a little corrupted (stray dots on the screen, an unexpected font change, that kind of stuff) in Impress when you drag the scroll bar up and down. Describe four follow-up tests that you would run, one for each of the tactics that you listed above.

I’m not going to solve this puzzle for you, but the solution should be straightforward if you understand the categories.

The slides work through a slightly different example:

A program unexpectedly but slightly scrolls the display when you add two numbers:

  • The task is entering numbers and adding
  • The failure is the scrolling.

Let’s consider the categories in terms of this example

1. Vary your behavior

When you run a test, you intentionally do some things as part of the test. For example, you might:

  • enter some data into input variables
  • write some data into data files tha tthe program will read
  • give the program commands

You might change any of these as part of your follow-up testing. (What happens if I do this instead of that?)

These follow-up tests might include changing the data or adding steps, substituting steps, or taking steps away.

For example, if adding one pair of numbers causes unexpected scrolling, suppose you try adding two numbers many times. Will the program scroll more, or scroll differently, as you repeat the test?

Suppose we modified the example so the program reads (and then adds) two numbers from a data file. Changing that data file would be another example of varying your behavior.

The slides give several additional examples.

2. Vary the options and settings of the program

Think about Microsoft Word. Here are some examples of its options:

  • Show (or don’t show) formatting marks on the screen
  • Check spelling (or don’t check it) as you type

In addition, you might change a template that controls the formatting of new documents. Some examples of variables you might change in the template are:

  • the default typeface
  • the spacing between tab stops
  • the color of the text

Which of these are “options” and which of these are “settings”? It doesn’t matter. The terminology will change from program to program. What doesn’t change is that these are persistent variables. Their value stays with the program from one document to another.

3. Vary data that you load into the program

This isn’t well worded. Students confuse what I’m talking about with the basic test data.

Imagine again testing Microsoft Word. Suppose that you are focused on the format of the document, so your test documents have lots of headings and tables and columns and headers and footers (etc.). If you change the document, save it, and load the revised one, that is data that you load into the program, but I think of that type of change as part of 1. Vary your behavior.

When Word starts up, it also loads some files that might not be intentional parts of your test. For example, it loads a dictionary, a help file, a template, and probably other stuff. Often, we don’t even think of these files (and how what they hold might affect memory or performance) when we design tests. Sometimes, changing one of these files can reveal interesting information.

4. Vary the software and hardware environment

For example,

  • Change the operating system’s settings, such as the language settings
  • Change the hardware (a different video card) or how the operating system works with the hardware (a different driver)
  • Change hardware settings (same video card, different display resolution)

This is a type of configuration testing, but the goal here is to try to demonstrate a more impressive failure, not to assess the range of circumstances under which the bug will appear.

Investigating whether an obscure-looking failure will arise under more circumstances

In this case, the failure shows up under what appear to be special circumstances. Does it only show up under special circumstances?

Slides 58-68 discuss this, but some additional points are made on other slides or in the taped lecture. Bringing them together…

  1. Uncorner your corner cases
  2. Look for configuration dependence
  3. Check whether the bug is new to this version
  4. Check whether failures like this one already appear in the bug database
  5. Check whether bugs of this kind appear in other programs

Here are a few notes:

1. Uncorner your corner cases

Risk-based tests often use extreme values (boundary conditions). These are good for exposing a failure, but once you find the failure, try less extreme values. Demonstration of failure with a less extreme test will yield a more credible report.

2. Look for configuration dependence

In this case, the question is whether the failure will appear on many configurations or just this one. Try it with more memory, with another version of operating system (or on another OS altogether), etc.

3. Check whether the bug is new to this version

Does this bug appear in earlier versions of the program? If so, did users (or others) complain about it?

If the bug is new, especially if it was a side-effect of a fix to another bug, some people will take it much more seriously than a bug that has been around a long time but rarely complained about.

4. Check whether bugs like this one already appear in the bug database

One of the obvious ways to investigate whether a failure appears under more general circumstances than the ones in a specific test is to check the bug tracking database to see whether the failure has in fact occurred under other circumstances. It often takes some creativity and patience to search out reports of related failures (because they probably aren’t reported in exactly the same way as you’re thinking of the current failure), but if your database has lots of not-yet-fixed bugs, such a search is often worthwhile.

5. Check whether bugs of this kind appear in other programs

The failure you’re seeing might be caused by the code specifically written by this programmer, or the bug might be in a library used by the programmer. If it’s in the library, the same bug will be in other programs. Similarly, some programmers model (or copy) their code from textbook descriptions or from code they find on the web. If that code is incorrect, the error will probably show up in other programs.

If the error does appear in other programs, you might find discussions of the error (how to find it, when it appears, how serious it is) in discussions of those programs.

This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.

An Overview of High Volume Automated Testing

January 14th, 2013

Summary: This overview describes the general concept of high volume automated testing (HiVAT) and twelve examples of HiVAT techniques. The common thread of these techniques is automated generation, execution and evaluation of arbitrarily many tests. The individual tests are often weak, but taken together, they can expose problems that individually-crafted tests will miss. The simplest techniques offer high coverage: they can reach a program’s weakness in handling specific special cases. Other techniques take the program through long sequences of tests and can expose programs that build gradually (e.g. memory leaks, stack corruption or memory corruption) or that involve unexpected timing of events. In my experience, these sequences have exposed serious problems in embedded software, especially in systems that involved interaction among a few processors. As we embed more software into more consumer systems, including systems that pose life-critical risks, I believe these tests will become increasingly important.

I am writing this note as an intentionally rough draft. It servers as an introduction for a course on HiVAT at Florida Institute of Technology. It provides a structure for work in the course. It is intentionally under-referenced. One of the students’ tasks in the course is to dig through a highly disjointed literature and map research papers, practitioner papers, and conference presentations to the techniques listed here. Students might also add HiVAT techniques, with associated papers, or add sections on directly-relevant automation-support technology or directly-relevant surveys of test automation strategies / results. I will replace this post with later drafts that are more academically complete as we make progress in the course.

Consider automated regression testing. We reuse a regression test several times–perhaps running it on every build. But are these really automated? The computer might execute the tests and do a simple evaluation of the results, but a human probably designed that test, a human probably wrote the test code that the computer executes, a human probably provided the test input data by coding it directly into the test or by specifying parameters in an input file, a human probably provided the expected results that the program uses to evaluate the test result, and if there appears that there might be a problem, it will be a human who inspects the results, does the troubleshooting and either writes a bug report (if the program is broken) or rewrites the test. All that work by humans is manual.

Notice, by the way, that this interplay of human work and computer work applies whether the tests are run at the system level using a tool like QuickTest Pro, at the unit level using a tool like jUnit, or at a hybrid level, using a tool like FIT.

So, automated regression test are actually manual tests with an automated tint.

And every manual software test is actually automated. When you run a manual test, you might type in the inputs and look at the outputs, but everything that happens from the acceptance of those inputs to the display of the results of processing them is done by a computer under the control of a program–that’s automated.

The difference between “manual” and “automated” is a matter of degree, not a matter of principle. Some tests are more automated, others are less automated, but there is no “all” and there is no “none” for test automation.

High-Volume Tests Focused on Inputs

High-Volume Parametric Variation

We can make the automated regression tests more automated by transforming some of the human tasks to computer tasks. For example, imagine testing a method that takes two inputs (FIRST, SECOND) and returns their sum (SUM). A typical regression test would include specific values for FIRST, SECOND, and EXPECTED_SUM. But suppose we replace the specific test values with a test data generator that supplies random values for FIRST and SECOND and calculates the value of EXPECTED_SUM. We can generate billions of tests this way, each a little different, with almost no increase in human effort.

This is one of the simplest examples of high volume automated testing. A human supplied the algorithm, but the test tool applies that algorithm to create, run, and evaluate the results of arbitrarily many tests.

In this example, those tests are all pretty similar. Why bother running them all? The traditional answer is “don’t bother.” Domain testing is most widely used software testing technique. The point of domain testing is to help the tester minimize test redundancy by selecting a small number of inputs to represent the larger set of possibilities. Usually, this works well. Occasionally, a program has a problem with a small number of specific inputs. For example, the program might be optimized in a way that processes a few specific values specially. Or it might be vulnerable to small calculation errors that are usually too small to notice, but occasionally have a visible effect (see Doug Hoffman’s report at http://www.testingeducation.org/BBST/foundations/Hoffman_Exhaust_Options.pdf for an example.

High-Volume Combination Testing

Rather than thinking of testing a single function that takes a few inputs, imagine something more complex. A program processes several inputs, perhaps with a series of functions, and reports an output. Again, the traditional approach is to minimize redundancy. When we do combination tests (test several variables together), the set of possible tests is usually huge. If the variables are independent, we can minimize the number of combination tests with combinatorial testing (e.g. all-pairs or all-triples). If the variables are related, we can use cause-effect graphing instead. But if we suspect that the program will fail only on a small number of specific combinations (and we don’t know which few those are), we have to test a larger set of combinations, generating input values and calculating the expected result for each combination.

There are different strategies for generating the input values:

  • Exhaustive sampling. This tests all the combinations but the set might be impossibly large.
  • Random sampling. Generate random values for the inputs, stopping when some large number have been tested.
  • Optimized sampling. Use an algorithm that optimizes the set of combinations in some way. As a very simple example, if you are going to test a million combinations, you could divide the space of combinations into a million same-size, non-overlapping subsets and sample one value of each. Or you could use a sequential algorithm that assigns values for the next combination by creating a test that is distant (according to some distance function) from all previous tests.

Input Fuzzing

So far, I’ve presented tests that generate both, the inputs and the expected value of the test. The expected value serves as an oracle, a mechanism for deciding whether the program passed or failed the test. Test oracles are incomplete, but they are useful for automation. Even if we can’t detect all possible failures, we can detect any failures of a certain kind (such as a calculation error).

“Fuzzing” refers to a family of high-volume automated tests that vary the inputs but have no oracle. The test runs until the program crashes or fails in some other unmissable way.

Hostile Data Stream Testing

Alan Jorgensen tested for many types of security errors by taking a good file in a standard format (e.g. PDF) and corrupting by substituting one string in the file with another. In Jorgensen’s work, the new string was typically much longer or was syntactically different. Jorgensen would then open the now-corrupt file with the application under test. The program might reject the file as corrupt or accept it. If it accepted the file, Jorgensen could ask the program to use the file in some way (e.g. display it) and look for misbehavior. In some cases, Jorgensen could exploit a failure to recognize a corrupt file by embedding executable code that the program would then inappropriately execute. Jorgensen’s search for corruptions that would escape detection was an intensely automated activity. His analysis of exploitability was not.

There are several published variations of file fuzzing (variation of the contents or format of an input file) that are suited for different kinds of risks.

High-Volume Tests that Exploit the Availability of an Oracle

Having an oracle gives you a strong basis for high-volume automated testing. All that’s required is to generate inputs to the program under test and to drive the program to process the inputs. Given a result, you can use the oracle to check whether the program passed the test. No oracle is perfect: the program might fail the test even though its output matches the oracle’s. But even though you can’t learn everything that might possibly be interesting by relying on an oracle, you have a basis for running a boatload of tests that collectively check thoroughly for some types of failures and give you an opportunity to stumble onto some other types of failures (e.g. crashes from memory leaks) that you didn’t design the tests to look for. For any other ways that you can imagine the program might fail, you can design tailored manual tests as needed.

High-volume testing with an oracle won’t completely test the program (against all possible risks), but it will provide you with a level of coverage that you can’t achieve by hand.

Function Equivalence Testing

Function equivalence testing starts with an oracle–a reference function that should work the same way as the function under test. Given the reference function, you can feed inputs to the function under test, get the results, and check whether the reference function gives the same results. You can test with as many input values as you want, perhaps generating a large random set of inputs or (if you have enough available computer time) testing every possible input.

The final exam in my Programmer-Testing course illustrates function equivalence testing. The question that students have to answer is whether Open Office Calc does its calculations the same way as Microsoft Excel. We adopt a quality criterion: If Calc does it the same way as Excel, that’s good enough, even if Excel makes a few calculation errors.

To answer the question, the students:

  • test several functions individually
  • then test those functions together by testing formulas that combine several functions

To do this,

  • The students pick several individual functions in Calc
  • They test each by feeding random inputs to the Calc function and the same inputs to the corresponding Excel function.
  • Then they create random formulas that combine the functions, feed random data to the functions in the formula, and compare results.

If you test enough inputs, and the Calc results are always the same as Excel’s (allowing a little rounding error), it is reasonable to conclude that the calculation in Calc is equivalent to the calculation in Excel.

Constraint Checks

We use a constraint oracle to check for impossible values or impossible relationships.

For example an American ZIP code must be 5 or 9 digits. If you are working with a program that reads (or looks up or otherwise processes) ZIP codes, you can check every code that it processes. If it accepts (treats as a ZIP code) anything that has non-numeric characters or the wrong number of characters, then it has a bug. If you can find a way to drive the program so that it reads (or does whatever it does with) lots of ZIP codes, you have a basis for a set of high-volume automated tests.

Inverse Operations

Imagine taking a list that is sorted from high to low, sorting it low to high, then sorting it back (high to low). If you can give this program enough lists (enough sizes, enough diversity of values), you can eventually conclude that it sorts correctly (or not). Any operation that you can invert, you can build a high volume test series against.

State-Model Based Testing (SMBT)

If you have a state model (including a way to decide where the program should go if you give it an input and a way to determine whether the program actually got there), you can feed the program an arbitrarily long sequence of inputs and check the results.

I think the most common way to do SMBT is with a deterministic series of tests, typically the shortest series that will achieve a specified level of coverage. The typical coverage goal is every transition from each possible state to every state that it can reach next. You can turn this into a high-volume series by selecting states and inputs randomly and running the sequence for an arbitrarily long time. Ben Simo cautions that this has to be monitored because some programs will get locked into relatively tight loops, never reaching some states or some transitions. If you write your test execution system to check for this, though, you can force it out of the loop and into a not-yet hit state.

Diagnostics-Based Testing

I worked at a company that designed telephone systems (Telenova). Our phones gave customers a menu-driven interface to 108 voice features and 110 data features. Imagine running a system test with 100 phones calling each other, putting each other on hold, transferring calls from one phone to another, then conferencing in outside lines, etc. We were never able to create a full state model for our system tests. They were just too complex.

Instead, the Telenova staff (programmers, testers, and hardware engineers) designed a simulator that could drive the phones from state to state with specified or random inputs. They wrote probes into the code to check whether the system was behaving in unexpected ways. A probe is like an assert command, but if the program triggers the probe, it logs an error instead of halting the program. A probe might check whether a variable had an unexpected value, whether a set of variables had unexpected values relative to each other, whether the program went through a series of states in an unexpected order, etc.

Because these probes checked the internal state of the system, and might check any aspect of the system, we called them diagnostics.

Implementing this type of testing required a lot of collaboration. The programmers wrote probes into their code. Testers did the first evaluations of the test logs and did extensive troubleshooting, looking for simple replication conditions for events that showed up in the logs. Testers and programmers worked together to fix bugs, change the probes, and specify the next test series.

As challenging as the implementation was, this testing revealed a remarkable number of interesting problems, including problems that would have been very hard to find in traditional ways but had the potential to cause serious failures in the field. This is what convinced me of the value of high-volume automated testing.

High-Volume Tests that Exploit the Availability of Existing Tests or Tools

Sometimes, the best reason to adopt a high-volume automated technique is that the adoption will be relatively easy. That is, large parts of the job are already done or expensive tools that can be used to do the job are already in place.

Long-Sequence Regression Testing (LSRT)

For example, Pat McGee and I wrote about a well-known company that repurposed its huge collection of regression tests of its office-automation products’ firmware. (In deference to the company’s desire not to be named, we called it Mentsville.)

When you do LSRT, you start by running the regression tests against the current build. From that set, pick only tests that the program passes. Run these in random order until the program fails (or you’ve run them for a long enough time). The original regression tests were designed to reveal functional problems, but we got past those by using only tests that we knew the program could pass when you ran them one-at-a-time. The bugs we found came from running the tests in a long series. For example, the program might run a series of 1000 tests, thirty of them the same as the first (Test 1), but it might not fail Test 1 until that 30th run. Why did it fail on time 30 and not before?

  • Sometimes, the problem was a gradual build-up of bad data in the stack or memory.
  • Sometimes, the problem involved timing. For example, sometimes one processor would stay busy (from the last test) for an unexpectedly long time and wouldn’t be ready when it was expected for this test. Or sometimes the firmware would expect a location in memory to have been updated, but in this unusual sequence, the process or processor wouldn’t yet have completed the relevant calculation.
  • Sometimes the problem was insufficient memory (memory leak) and some of the leaks were subtle, requiring a specific sequence of events rather than a simple call to a single function.

These were similar to the kinds of problems we found at Telenova (running sequences of diagnostics-supported tests overnight).

Troubleshooting the failures was a challenge because it was hard to tell when the underlying failure actually occurred. Something could happen early in testing that wouldn’t cause an immediate failure but would cause gradual corruption of memory until hours later, the system crashed. That early result was what was needed for replicating the failure. To make troubleshooting easier, we started running diagnostics between tests, checking the state of memory, or how long a task had taken to execute, or whether a processor was still busy, or any of over 1000 other available system checks. We only ran a few diagnostics between tests. (Each diagnostic changed the state of the system. We felt that running too many diagnostics would change the state too much for us to get an accurate picture of the the effect of running several regression tests in a row.) But those few tests could tell us a great deal. And if we needed more information, we could run the same 1000-test sequence again, but with different diagnostics between tests.

High-Volume Protocol Testing

A protocol specifies the rules for communication between two programs or two systems. For example, an ecommerce site might interact with VISA to bill a customer’s VISA card for a purchase. The protocol specifies what commands (messages) the site can send to VISA, what their structure should be, what types of data should appear in the messages and where, and what responses are possible (and what they mean). A protocol test sends a command to the remote system and evaluates the results (or sends a related series of commands and evaluates the series of responses back).

With popular systems, like VISA, lots of programs want to test whether they work with the system (whether they’ve implemented the protocol correctly and how the remote system actually responds). A tool to run these types of tests might be already available, or pieces of it might be available. If enough is available, it might be easy to extend it, to generate long random sequences of commands to the remote system and to process the system’s responses to each one.

Load-Enhanced Functional Testing

Common lore back in the 1970’s was that a system behaved differently, functionally differently, when it was running under load. Tasks that it could perform correctly under “normal” load were done incorrectly when the system got busy.

Alberto Savoia described his experience of this in a presentation at the STAR conference in 2000. Segue (a test automation tool developer) built a service around it. As I understand their results, a system could appear 50%-busy (busy, but not saturated) but still be unable to correctly run some regression tests. The value of this type of testing is that it can expose functional weaknesses in the system. For example, suppose that a system running low on memory will rely more heavily on virtual memory and so the timing of its actions slows down. Suppose that a program running on the system spreads tasks across processors in a way that makes it vulnerable to race conditions. Suppose that Task 1 on Processor 1 will always finish before Task 2 on Processor 2 when the system is running under normal load, but under some heavy loads, Processor 1 will get tied up and won’t process work as quickly as Processor 2. If the program assumes that Task 1 will always get done before Task 2, that assumption will fail under load (and so might the program).

In the field, bugs like this produce hard-to-reproduce failures. Unless you find ways to play with the system’s timing, you might never replicate problems like this in the lab. They just become mystery-failures that a few customers call to complain about.

If you are already load-testing a program and if you already have a regression suite, then it might be very easy to turn this into a long-sequence regression test running with a moderate-load load test in parallel. The oracles include whatever comes with each regression test (to tell you whether the program passed that test or not) plus crashes, obvious long delays, or warnings from diagnostics (if you add those in).

A Few More Notes on History

I think that most of the high-volume work has been done in industry and not published in the traditional academic literature. For example, the best-known (best-publicized) family of high-volume techniques are “fuzzing”, attributed to Professor Barton Miller at the University of Wisconsin in 1988.

  • But I was seeing this industrial demonstrations of this approach when I first moved to Silicon Valley in 1983, and I’ve been told of applications as early as 1966 (the “Evil” program at Hewlett-Packard).
  • Long-sequence regression testing was initially developed in 1984 or 1985. I played a minor role in developing Telenova’s diagnostics-based approach in 1987; my impression was that we were applying ideas already implemented years ago at other telephone companies.
  • The testing staff at WordStar (a word-processing company) used a commercially available tool to feed long sequences of tests from one computer to another in 1984 or 1985. The first computer generated the tests and analyzed the test results. The second machine ran the program under test (WordStar). They were hooked up so that commands from the first machine looked like keyboard inputs to the second, and outputs that the second machine intended for its display actually went to the second machine for analysis. As an example of the types of bugs they found with this setup, they were able to replicate a seemingly-irreproducible crash that turned out to result from a memory leak. If you boldfaced a selection of text and then italicized it, there was a leak. If you italicized first, then applied bold, no leak. The test that exposed this involved a long random sequence of commands. I think this was a normal way to use that type of tool (long sequences with some level of randomization).
  • We also used random input generators in the telephone world. I think of a tool called The Hammer but there were earlier tools of this class. Hammer Technologies was formed in 1991 and I was hearing about these types of tools back in the mid-to-late-1980’s while I was at Telenova.

I’ve heard of related work at Texas Instruments, Microsoft, AT&T, Rolm, and other telephone companies, in the testing of FAA-regulated systems, at some auto makers, and at some other companies. I’m very confident that the work actually done is much more broadly spread than this small group of companies and includes many other techniques than I’ve listed here. However, most of what I’ve seen has been by semi-private demonstrations or descriptions or via demonstrations and descriptions at practitioner conferences. Most of the interesting applications that I have personally heard of have involved firmware or other software that controls hardware systems.

As far as I can tell, there is no common vocabulary for these techniques. Much of what has been published has been lost because many practitioner conference proceedings are not widely available and because old papers describing techniques under nonstandard names are just not being found in searches. I hope that we’ll fill in some of these gaps over the next several months. But even for the techniques that we can’t find in publications, it’s important to recognize how advanced the state of the practice was 25 years ago. I think we are looking for a way to make sometimes-closely-held industrial practices more widely known and more widely adopted, rather than inventing a new area.

This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.

 

Interactive Grading in University and Practitioner Classes: An Experience Report

January 14th, 2013

Summary: Graders typically work in private with no opportunity to ask even the simplest questions about a student’s submitted work. Interactive Grading is a technique that requires the student to participate in the grading of their work. The aim of this post is to share my experiences with interactive grading, with some tips for others who want to try it. I start with an overview and then provide three detailed examples, with suggestions for the conduct of the session. Let me stress one point: care must be exercised to keep the student comfortable and engaged, and not let the session degenerate into another lecture. In terms of results, most students told me that interactive grading was a good use of their time and that it helped them improve their performance. Several asked me to add more of it to my courses. A few viewed at as more indoctrination. In terms of impact on student grades, they showed marginal improvement.

Interactive grading is not a new idea. I think most instructors have done this for some students at some times. Certainly, I have. So when professor Keith Gallagher described it to me as an important part of his teaching, I didn’t initially understand its significance. I decided to try it myself after several of Keith’s students talked favorably about their classes with him. It wasn’t until I tried it that I realized what I had been missing.

That start came 15 months ago. Since then, I’ve used interactive grading in the online (professional-development) BBST classes, hybrid (online + face-to-face) university software testing classes, and a face-to-face university software metrics course. I’ve used it for exams, essays, and assignments (take-home complex tasks). Overall, it’s been a positive change.

These notes describe my personal experiences and reflections as an instructor. I emphasize this by writing in the very-obvious first person. Your experiences might be different.

What is Interactive Grading?

When I teach, I assign tasks and students submit their work to me for grading.

Usually I review student work in private and give them feedback after I have completed my review.

  • When I do interactive grading:
    • I meet with the student before I review the work.
    • I read the work for the first time while I meet with the student.
    • I ask the student questions, often open-ended questions that help me understand what the student was trying to say or achieve. Students often demonstrate that they understood the material better than their submitted work suggests. If they misunderstood part of the task, we can get to the bottom of the misunderstanding and they can try to demonstrate, during this meeting, their ability to do the task that was actually assigned.
    • I often coach the student, offering suggestions to improve the student’s strategy or demonstrating how to do parts of the task.
    • I typically show the student a grading rubric early in the meeting and assign the grade at the end of the meeting.
  • When I explicitly build interactive grading into my class:
    • It becomes part of the normal process, rather than an exception for a student who needs special attention. This changes the nature and tone of the discussions.
    • Every student knows well in advance what work will be interactively graded and that every student’s work will be handled the same way. This changes how they prepare for the meetings and how they interpret the meetings.
    • I can plan later parts of the course around the fact that the students have already had this experience. This changes my course design.

Costs and Benefits of Interactive Grading

Here is a summary of my conclusions. I’ll support this summary later in this report, with more detailed descriptions of what we did and what happened.

Costs

Interactive grading feels like it takes more time:

  • It takes time to prepare a grading structure that the students can understand (and therefore that I can use effectively when we have the meeting)
  • Scheduling can take a lot of time.
  • The meetings sometimes run long. It feels as though it takes longer to have the meeting than it would take to grade the work normally.

When I’ve checked my grading times for exams and assignments that I do in the traditional way, I think I actually spend the same amount of time (or more). I also do the same level of preparation. (Note: I do a lot of pre-grading preparation. The contrast might be greater for a less formal grader.)

As far as I can tell, the actual difference for me is not time, it is that interactive grading meetings are more stressful for me than grading in a quiet, comfortable home office. That makes it feel longer.

Benefits for the Students

During interactive grading, I can ask questions like,

  • What were you thinking?
  • What do you think this word in the question means? If I gave you a different explanation of what this word means, how would that affect your answer to the question?
  • Give me an example of what you are describing?
  • Can you give me a real-life example of what you are describing? For example, suppose we were working with OpenOffice. How would this come up in that project?
  • Can you explain this with a diagram? Show me on my whiteboard.
  • How would you answer this if I changed the question’s wording this way?
  • How would someone actually do that?
  • Why would anyone want to do that task that way? Isn’t there a simpler way to do the same thing?

I will raise the grade for a student who does a good job with these questions. I might say to the student,

“If I was only grading the written answer, you would get a ‘D’. But with your explanation, I am giving you a ‘B’. We need to talk about how you can present what you know better, so that you can get a ‘B’ again on the next exam, when I grade your written answers without an interactive supplement.”

If a student performs poorly on an exam (or an assignment, or an essay), the problem might be weak competence or weak performance.

  • A student who doesn’t know the material has no competence.
  • A student who knows the material but gives a poor answer on an exam despite that is showing poor performance. For example, you won’t get a good answer from a knowledgeable student who writes poorly or in a disorganized way or who misunderstands the question.

These focusing questions:

  • give the student who knows the material a chance to give a much better explanation or a much better defense of their answer.
  • give the student who knows the material a chance but performs poorly some indicators of the types of performance improvements they need to work on.
  • give the student who doesn’t know the material a clear form of feedback on why they are getting a poor grade.

Let’s consider competence problems. Students might lack the knowledge or the skills they are supposed to be learning for several reasons:

  • Some students simply don’t take the time or make the effort to do good work. Interactive grading probably won’t do much for them, beyond helping some of them understand the standards better.
  • Some students memorize words that they don’t really understand. The interactive grading discussion helps (some of) them understand a bit better the differences between memorized words and understanding something well enough to explain it in their own words and to explain how to do it or use it or why it’s important. It gives them a path to a different type of answer when they ask themselves while studying, “Do I know this well enough?”
  • Some students lack basic student-skills (how to study, how to look things up online, how to use the library, etc.) I can demonstrate these activities during the discussion, having the student do them with me. Students don’t become experts overnight with these, but as I’ll discuss below, I think this sometimes leads to noticeable improvements.

Some of the tasks that I assign to students can be done in a professional way. I choose tasks that I can demonstrate at a professional level of skill. In the give-and-take of interactive grading, I can say, “Let me show you how a professional would do that.” There is a risk of hijacking the discussion, turning it into yet-another-lecture. But judiciously used, this is a very personalized type of coaching of complex skills.

Now consider performance problems. The student understands the material but provides an exam answer or submits an assigned paper that doesn’t adequately reflect what they know, at their level of sophistication. A student with an “A” level of knowledge might look like a “C” student. A student with a “C” level of knowledge might look like a “D” or “F” student. These students are often puzzled by their poor grades. The interactive grading format makes it possible for me to show a student example after example after example of things they are doing (incomprehensible sentences, uninterpretable diagrams, confusing structure, confusing formatting, etc.) and how these make it hard on the reader. Here are two examples:

  • Some of my students speak English as a second language. Some speak/write it well; some are trying hard to communicate well but make grammar/spelling errors that don’t interfere with my ability to understand their writing. Some write sentences that I cannot understand. They expect me to guess their meaning and they expect me to bias my guessing in their favor. During interactive grading, I can read a sentence, realize that I can’t understand it, and then ask the student what it means. In some cases, the student had no idea (they were, as far as I can tell, bluffing, hoping I would give them points for nothing). In other cases, the student intended a meaning but they realized during the discussion that they were not conveying that meaning. For some students, this is a surprise. They didn’t realize that they were failing to communicate. Some have said to me that they had previously thought they were being downgraded for minor errors in their writing (spelling, grammar) rather than for writing something that the instructor could not understand. For some students, I think this changes their motivation to improve their writing.
  • Some students write disorganized answers, or answers that are organized in a fundamentally different way from the structure explicitly requested by the question. Some students do this strategically (with the goal of disguising their ignorance). Others are simply communicating poorly. In my experience, retraining these students is very hard, but this gives me another opportunity to highlight the problems in the work, to demonstrate how the problems affect how I analyze and evaluate their work, and how they could do it differently. (For more on this, see the discussion in the next section, Benefits for Me).

Benefits for Me

It’s easy to grade “A” work. The student “gets it” and so they get a high grade. When I recognize quickly that the student has met my objectives for a specific exam question, I stop analyzing it, award a very high grade, and move to the next question. Very fast.

In contrast, a typical “C” or “D” answer takes a lot longer to grade. The answer is typically disorganized, confused, hard to understand, rewords the question in an effort to present the question as its answer, seems to make inappropriate assumptions about what I should find obvious, has some mistakes, and/or has contradictions or inconsistencies that make the answer incoherent even though each individual component could be argued to be not-necessarily-wrong.

When I say, “disorganized”, I mean (for example) that if the question asks for parts (1), (2), (3), (4) and (5), the student will give three sections instead that address (in the first one) (1), (3) and (5), (in the second one) (1) and (2) and (in the third one) (3) and (5) with a couple of words from the question about part (4) but no added information.

I waste a lot of time trying to understand these answers. When I grade privately, I read a bad answer over several times, muttering as I try to parse the sentences and map the content to the question that was asked. It’s uncertain work–I constantly question whether I am actually understanding what the student meant and trying to figure out how much benefit of how much doubt I should give the student.

When I do interactive grading with a student, I don’t have to guess. I can say, “The question asked for Part 1. I don’t see an answer directly to Part 1. Can you explain how your answer maps to this part of the question?” I insist that the student map their actual words on the exam to the question that was asked. If they can’t sort it out for me, they can flunk. Next time, based on this experience, they can write in a way that will make it easier to map the answer to the question (make it easier for me to grade; get them a better grade).

This discussion is difficult, it can be unpleasant for the student and for me, I have to be diplomatic and somewhat encouraging or it will become a mess. But I don’t have to struggle alone with not understanding what was written. I can put the burden back on the student.

Some other benefits:

  • Students who write essays by copying content without understanding it demonstrate their cluelessness when we grade the essay interactively.
  • Students who cheat on a take-home exam (in my experience so far) avoid the interactive grading session, where they would probably demonstrate that they don’t understand their own answers.
  • Students who want to haggle with me about their grade have an opportunity to do so without driving me crazy. I now have a way to say, Show me what makes this a ‘B’ instead of arguing with them about individual points or about their situational need for a higher grade.

Maybe the most important benefits:

  • (Most) students tell me they like it and many of them ask to do it again (to transform a subsequent task into an interactive-graded one or to add interactive grading in another course)
  • Their performance seems to improve, sometimes significantly, which makes the next work easier to grade
  • This is highly personalized instruction. The student is getting one-on-one attention from the professor. Many students feel as though they don’t get enough personal attention and this addresses that feeling. It also makes some students a little more confortable with dropping by my office for advice at other times.

A More Detailed Report

Someone who is just trying to learn what interactive grading should stop here. What follows is “nuts and bolts”.

I’ve done interactive grading for three types of work:

  • midterm exams
  • practical assignments (homework that requires the student to apply something they have learned to a real-life task)
  • research essays

The overall process for interactive grading is the same for all three. But I’ve also noticed some differences. Marking exams isn’t the same as marking essays; neither is grading them interactively.

If you’re going to try interactive grading for yourself, the differences among descriptions of how it worked for these might help you make a faster and surer start.

Structural/Logistical Matters

I tell students what “interactive grading” is at the start of the course and I tell them which pieces of their work will be graded interactively. In most courses (in all of my university courses for credit), I tell them that this is a mandatory activity.

Some students choose not to participate in interactive grading sessions. Until now, I would tolerate this and grade their work in the traditional way. However, no one has ever given me a good reason for avoiding these (and I have run into several bad ones). The larger problem for me is that later in the course, when I want to do something that assumes that every student has had the interactive grading experience, it is inappropriate for students who skipped it. In the future, in academic courses, I will reinforce the “mandatory” nature of the activity by assigning a grade of zero on the work to a student who (after being warned of this) chooses not to schedule a grading meeting.

Scheduling the meetings is a challenge. To simplify it, I suggest a Doodle poll (www.doodle.com). Post the times you can be available and let each student sign up for a time (from your list) that s/he is available.

I schedule the meetings for 1.5 hours. They often end at 1 hour. Some drag on. If the meeting’s going to run beyond 2 hours, I usually force the meeting to a conclusion. If I think there will be enough value for the student, I offer to schedule a follow-up meeting to go over the rest of the work. If not, then either I announce the grade at the end of the session or I tell the student that I will grade the rest of the work in the traditional way and get back to them with the total.

During the meeting, the student sits across a desk from me. I use a computer with 3 monitors. Two of the monitors show the same information. I look at one and turn the other toward the student. Thus, the student and I can see the same things without having to crowd together to look over each other’s shoulder at one screen. The third screen faces me. I use it to look at any information that I don’t want to share with the student. I use 27″ monitors (small enough to fit on my desk, big enough to be readable for most students) with 1980×1020 resolution. This easily fits two readable word-processing windows side-by-side, such as the student’s submitted work in one window and the grading guide in the next window.

Example 1: Midterm Exams

In some of my courses, I give the students a list of questions well before the exam and draw the exam questions from the list. In my Software Testing 1 course, for example (the “Black Box Testing Course”), I give students a 100-question subset of this list: http://www.testingeducation.org/BBST/takingexams/ExamEssayQuestions2010.pdf

I outlined the costs and benefits of this approach at WTST 2003, in: https://kaner.com/pdfs/AssessmentTestingCourse.pdf. For our purposes, the most important benefit is that students have time before the exam to prepare an answer to each question. They can’t consult their prepared answers during the actual exam, but this lets them come to the exam well-prepared, with a clear idea of what the question means, how to organize the answer, and what points they want to make in it.

I typically give 2 or 3 midterms in a course and I am typically willing to drop the worst one. Thus the student can do poorly on the first midterm, use that experience to learn how to improve their study strategy and writing, and then do better on the next one(s). I do the interactive grading with the first midterm.

Students sometimes submit exams on paper (traditional, handwritten supervised exam), sometimes in a word processor file (supervised exam where students type at a university-owned computer in a university class/exam-room), sometimes in a word processor file (unsupervised takehome exam). For our purposes, assume that the student submitted an electronic exam.

Before I start grading any student’s work, I prepare a grading guide that identifies the types of information that I expect to see in the answer and the points that are possible for that particular type of info. If you’ve never seen that type of grading structure, look at my slides and videos (“How we grade exams”) at http://www.testingeducation.org/BBST/takingexams/.

Many of my questions are (intentionally) subject to some interpretation. Different students can answer them differently, even reaching contradictory conclusions or covering different technical information, but earn full points. The grading guide will allow for this, offering points for several different clusters of information instead of showing only One True Answer.

During the meeting, I rely frequently on the following documents, which I drag on and off of the shared display:

  • The student’s exam
  • The grading guide for that exam
  • The set of all of the course slides
  • The transcripts of the (videotaped) lectures (some or all of my lectures are available to the students on video, rather than given live)
  • The assigned papers, if any, that are directly relevant to exam questions.

We work through the exam one question at a time. The inital display is a copy of the exam question and a copy of the student’s answer. I skim it, often running my mouse pointer over the document to show where I am reading. If I stop to focus, I might select a block of text with the mouse, to show what I am working on now.

  • If the answer is well done, I’ll just say “good”, announce the grade (“That’s a 10 out of 10”) Then I skip to the next answer.
  • If the answer is confusing or sometimes if it is incomplete, I might start the discussion by saying to the student, “Tell me about this.” Without having seen my grading guide, the student tells me about the answer. I ask follow-up questions. For example, sometimes the student makes relevant points that aren’t in the answer itself. I tell the student it’s a good point and ask where that idea is in the written answer. I listed many of my other questions near the start of this report.
  • At some point, often early in the discussion, I display my grading guide beside the student’s answer, explain what points I was looking for, and either identify things that are missing or wrong in the student’s answer or ask the student to map their answer to the guide.
    • Typically, this makes it clear that the point is not in the answer, and what the grading cost is for that.
    • Some students try to haggle with the grading.
      • Some simply beg for more points. I ask them to justify the higher grade by showing how their answer provides the information that the grading guide says is required or creditable for this question.
      • Some tell me that I should infer from what they did write that they must have known the other points that they didn’t write about and so I should give them credit. I explain that I can only grade what they say, not what they don’t say but I think they know anyway.
      • Some agree that their answer as written is incomplete or unclear but they show in this meeting’s discussion that they understand the material much better than their answer suggests. I often give them additional credit, but we’ll talk about why it is that they missed writing down that part of the answer, and how they would structure an answer to a question like this in the future so that their performance on the next exam is better.
      • Some students argue with the analysis I present in the guide. Usually this doesn’t work, but sometimes I give strong points for an analysis that is different from mine but justifiable. I might add their analysis to the grading guide as another path to high points for that answer.
  • The student might claim that I am expecting the student to know a specific detail (e.g. definition or fact) that wasn’t taught in the course. This is when I search the slides and lecture transcripts and readings, highlighting the various places that the required information appeared. I try to lead from here to a different discussion–How did you miss this? What is the hole in your study strategy?
  • Sometimes at the end of the discussion, I ask the student to go to the whiteboard and present a good answer to the question. I do this when I think it will help the student tie together the ideas we’ve discussed, and from doing that, see how to structure answers better in the future. A good presentation (many of them are good enough) might take the assigned grade for the question from a low grade to a higher one. I might say to the student, “That’s a good analysis. Your answer on paper was worth 3/10. I’m going to record a 7/10 to reflect how much better a job you can actually do, but next time we won’t have interactive grading so you’ll have to show me this quality in what you write, not in the meeting. If you give an answer this good on the next exam, you’ll get a 9 or 10.

In a 1.5 hour meeting, we can only spend a few minutes on each question. A long discussion for a single question runs 20 minutes. One of my tasks is to move the discussion along.

Students often show the same weakness in question after question. Rather than working through the same thing in detail each time, I’ll simply note it. As a common example, I might say to a student

Here’s another case where you answered only 2 of the 3 parts of the question. I think you need to make a habit of writing an outline of your answer, check that the outline covers every part of the question, and then fill in the outline.

Other students were simply unprepared for the exam and most of their answers are simply light on knowledge. Once it’s clear that lack of preparation was the problem (often it becomes clear because the student tells me that was the problem), I speed up the meeting, looking for things to ask or say that might add value (for example complimenting a good structure, even though it is short on details). There is no value in dragging out the meeting. The student knows the work was bad. The grade will be bad. Any time that doesn’t add value will feel like scolding or punishment, rather than instruction.

The goal of the meeting is constructive. I am trying to teach the student how to write the next exam better. We might talk about how the student studied, how the student used peer review of draft answers, how the student outlined or wrote the answer, how the student resolved ambiguities in the question or in the course material — and how the student might do this differently next time.

Especially if the student achieved a weak grade, I remind the student that this is the first of three midterms and that the course grade is based on the best two. So far, the student has lost nothing. If they can do the next two well, they can get a stellar grade. For many students, this is a pleasant and reassuring way to conclude the meeting.

The statistical results of this are unimpressive. For example, in a recent course, 11 students completed the course.

  • On midterm 1 (interactively graded) their average grade was 76.7.
  • On midterm 2, the average grade was 81.5
  • On midterm 3, the average grade was 76.3
  • On the final exam, the average grade was 77.8

Remember that I gave students added credit for their oral presentation during interactive grading, which probably added about 10 points to the average grade for midterm 1. Also, I think I grade a little more strictly toward the end of the course and so a B (8/10) answer for midterm 2 might be a B- (7.5) for midterm 3 or the final. Therefore, even though these numbers are flat, my subjective impression of the underlying performance was that it was improving, but this was not a powerful trend.

At the end of the meetings, I asked students whether they felt this was a good use of their time and whether they felt it had helped them. They all told me that it did. In another class (metrics), some students who had gone through interactive grading in the testing course asked for interactive grading of the first metrics midterm. This seemed to be another indicator that they thought the meetings had been helpful and sufficiently pleasant experiences.

This is not a silver bullet, but the students and I feel that it was helpful.

Example 2: Practical Assignments

In my software testing class, I assign tasks that people would do in actual practice. Here are two examples that we have taken to interactive grading:

  1. The student joins the OpenOffice (OOo) project (https://blogs.apache.org/OOo/entry/you_can_help_us_improve) and reviews unconfirmed bug reports. An unconfirmed bug has been submitted but not yet replicated. The student tries to replicate it and adds notes to the report, perhaps providing a simpler set of steps to get to the failure or information that the failure shows up only on a specific configuration or with specific data. The student also writes a separate report to our class, evaluating the communication quality and the technical quality of the original report. An example of this assignment is here: http://www.testingeducation.org/BBST/bugadvocacy/AssignmentBugEvaluationv11.3.pdf
  2. The student picks a single variable in OpenOffice Writer (such as the number of rows in a table) and does a domain analysis of it. In the process, the student imagines several (about 20) ways the program could fail as a consequence of an attempt to assign a value to the variable or an attempt to use the variable once that value has been assigned. For each risk (way the program could fail), the student divides the values of the variable into equivalence classes (all the values within the same class should cause the test with that variable and that value to behave the same way) and then decides which one test should be used from each class (typically a boundary value). An example of this assignment is here: http://www.testingeducation.org/BBST/testdesign/AssignmentRiskDomainTestingFall2011.pdf

The Bug Assignment

When I meet with the student about the bug assignment, we review the bug report (and the student’s additions to it and evaluation of it). I start by asking the student to tell me about the bug report. In the discussion, I have several follow-up questions, such as

  • how they tried to replicate the report
  • why they stopped testing when they did
  • why they tested on the configurations they did
  • what happened when they looked for similar (often equivalent) bugs in the OOo bug database and whether they learned anything from those other reports,

In many cases, especially if the student’s description is a little confusing, I will bring up OpenOffice and try to replicate the bug myself. Everything I do is on the shared screen.They see what I do while I give a running commentary.

  • Sometimes I ask them to walk me through the bug. They tell me what to do. I type what they say. Eventually, they realize that the instructions they wrote into the bug report aren’t as clear or as accurate as they thought.
  • Sometimes I try variations on the steps they tried, especially if they failed to replicate the bug.

My commentary might include anecdotes about things that have happened (that I saw or did) at real companies or comments/demonstration of two ways to do essentially the same thing, with the second being simpler or more effective.

When I ask students about similar bugs in the database, most don’t know how to do a good search, so I show them. Then we look at the reports we found and decide whether any are actually relevant.

I also ask the student how the program should work and why they think so. What did they do to learn more about how the program should work? I might search for specifications or try other programs and see what they do.

I will also comment on the clarity and tone of the comments the student added to the OOo bug report, asking the student why they said something a certain way or why they included some details and left out others.

Overall, I am providing different types of feedback:

  • Did the student actually do the work that was assigned? Often they don’t do the whole thing. Sometimes they miss critical tasks. Sometimes they misunderstand the task.
  • Did the student do the work well? Bug reporting (and therefore this assignment) involves a mixture of persuasive technical writing and technical troubleshooting.
    • How well did they communicate? How much did they improve the communication of the original report? How well did they evaluate the communication quality of the original report?
    • How well did they troubleshoot? What types of information did they look for? What other types could they have looked for that would probably have been helpful? What parameters did they manipulate and how wise were their choices of values for those parameters?

Thinking again about the distinction between competence and performance,

  • Some students show performance problems (for example, they do the task poorly because they habitually follow instructions sloppily). For these students, the main feedback is about their performance problems.
  • Some students show competence problems–they are good at following instructions and they can write English sentences, but they need to learn how to do a better job of bug reporting. For these students, the main feedback is what they are already doing well and what professional-quality work of this type looks like.

This is a 4-phase assignment. I try to schedule the interactive grading sessions soon after Phase 1, because Phase 3 is a more sophisticated repetition of Phase 1 (similar task on a different bug report). The ideal case is Phase 1 — Feedback — Phase 3. Students who were able to schedule the sessions this way have told me that the feedback helped them do a much better job on Phase 3.

The Domain Testing Analysis

Every test technique is like a lens that you look through to see the program. It brings some aspects of the program into clear focus, and you test those in a certain way. It pretty much makes the other aspect of the program invisible.

In domain testing, you see a world of variables. Each variable stands out as a distinct individual. Each variable has sets of possible values:

  • The set of values that users might try to assign to the variable (the values you might enter into a dialog box, for example). Some of these values are invalid — the variable is not supposed to take on these values and the program should reject them.
  • The set of values that the variable might actually take on — other parts of the program will use this variable and so it is interesting to see whether this variable can take on any (“valid”) values that these parts of the program can’t actually handle.
  • The set of values that might be output, when this variable is displayed, printed, saved to disk, etc.

These sets overlap, but it is useful to recognize that they often don’t map perfectly onto each other. A test of a specific “invalid” value might be sometimes useful, sometimes uninteresting and sometimes impossible.

For most variables, each of these sets is large and so you would not want to test every value in the set. Instead, domain testing has you group values as equivalent (they will lead to the same test result) and then sample only one or two values from every set of equivalents. This is also called equivalence-class analyis and boundary testing.

One of the most challenging aspects of this task is adopting the narrow focus of the technique. People aren’t used to doing this. Even working professionals — even very skilled and experienced working professionals — may not be used to doing this and might find it very hard to do. (Some did find it very hard, in the BBST:Test Design course that I taught through the Association for Software Testing and in some private corporate classes.)

Imagine an assignment that asks you to generate 15 good tests that are all derived from the same technique. Suppose you imagine a very powerful test (or a test that is interesting for some other reason) that tests that variable of that program (the variable you are focusing on). Is it a good test? Yes. Is it a domain test? Maybe not. If not, then the assignment is telling you to ignore some perfectly good tests that involve the designated variable, maybe in order to generate other tests that look more boring or less powerful. Some people find this confusing. But this is why there are so many test techniques (BBST: Test Design catalogs over 100). Each technique is better for some things and worse for others. If you are trying to learn a specific technique (you can practice a different one tomorrow), then tests that are not generated by that technique are irrelevant to your learning, no matter how good they are in the general scheme of things.

We run into a conflict of intuitions in the practitioner community here. The view that I hold, that you see in BBST, is that the path to high-skill testing is through learning many different techniques at a high level of skill. To generate a diverse collection of good tests, use a diverse set of techniques. To generate a few tests that are optimized for a specific goal, use a technique that is appropriate for that goal. Different goals, different techniques. But to do this, you have to apply a mental discipline while learning. You have to narrow your focus and ask, “If I was a diehard domain tester who had no use for any other technique, how would I analyze this situation and generate my next tests?” Some people have told me this feels narrow-minded, that it is more important to train testers to create good tests and to get in touch with their inner creativity, than to cramp their style with a narrow vision of testing.

As I see it, this type of work doesn’t make you narrow-minded. It doesn’t stop you from using other techniques when you actually do testing at work. This is not your testing at work. It is your practice, to get good enough to be really good at work. Think of practicing baseball. When you are in batting practice, trying to improve your hitting, you don’t do it by playing catch (throwing and catching the ball), not even if it is a very challenging round of catch. That might be good practice, but it is not good batting practice.

The domain testing technique helps you select a few optimal test values from a much larger set of possibilities. That’s what it’s good for. We use it to pick the most useful specific values to test for a given variable. Here’s a heuristic: If you design a test that would probably yield the same result (pass/fail) now matter what value is in the variable, you are probably focused on testing a feature rather than a variable, and you are almost certainly not designing using domain testing.

Keeping in mind the distinction between competence and performance,

  • Some students show performance problems (for example, they do the task poorly because they habitually follow instructions sloppily). For these students, the main feedback is about their performance problems.
  • The students who show competence problems are generally having trouble with the idea of looking at the world through the lens of a single technique. I don’t have a formula for dealing with these students. It takes individualized questioning and example-creating, that doesn’t always work. Here are some of the types of questions:
    • Does this test depend on the value of the variable? Does the specific value matter? If not, then why do we care which values of the variable we test with? Why do a domain analysis for this?
    • Does this test mainly depend on the value of this variable or the value of some other variable?
    • What parts of the program would care if this variable had this value instead of that value?
    • What makes this specific value of the variable better for testing than the others?

From my perspective as the teacher, these are the hardest interactive grading discussions.

  • The instructions are very detailed but they are a magnet for performance problems that dominate the discussions with the weaker student.
  • The technique is not terribly hard, if you are willing to let yourself apply it in a straightforward way. The problem is that people are not used to explicitly using a cognitive lens, which is essentially what a test technique is.

Student feedback on this has been polite but mixed. Many students are enthusiastic about interactive grading and (say they) feel that they finally understood what I was talking about after the discussion that applied it to their assignment. Other students came away feeling that I had an agenda, or that I was inflexible.

Example 3: Research Essays

In my Software Metrics course, I require students to write two essays. In each essay, the student is required to select one (1) software metric and to review it. I give students an outline for the essay that has 28 sections and subsections, requiring them to analyze the validity and utility of the metric from several angles. They are to look in the research literature to find information for each section, and if they cannot find it, to report how they searched for that information (what search terms in which electronic databases) and summarize the results. Then they are to extrapolate from the other information they have learned to speculate what the right information probably is.

These are 4th year undergraduates or graduate students. A large percentage of these students lack basic library skills. They do not know how to do focused searches in electronic databases for scholarly information and they do not know how to assess its credibility or deal with conflicting results and conflicting conclusions. Many of them lack basic skills in citing references. Few are skilled at structuring a paper longer than 2 or 3 pages. I am describing good students at a well-respected American university that is plenty hard to get into. We have forced them to take compulsory courses in writing, but those courses only went so far and the students only paid so much attention. My understanding, having talked at length with faculty at other schools, is that this is typical of American computer science students.

My primary goal is to help students learn how to cut through the crap written about most metrics (wild claims pro and con, combined with an almost-shocking lack of basic information) so that they can do an analysis as needed on the job.

Metrics are important. They are necessary for management of projects and groups. Working with metrics will be demanded of most people who want to rise beyond junior-level manager or mid-level programmer and of plenty of people whose careers will dead-end below that level. But badly-used metrics can do more harm than good. Given that most (or all) of the software metrics are poorly researched, have serious problems of validity (or at least have little or no supporting evidence of validity), and have serious risk of causing side-effects (measurement dysfunction) to the organization that uses them, there is no simple answer to what basket of metrics should be used for a given context. On the other hand, we can look at metrics as imperfect tools. People can be pretty good at finding information if they understand what they are looking for and they understand the strengths and weaknesses of their tools. And they can be pretty good at limiting the risks of risky things. So rather than encouraging my students to adopt a simplistic, self-destructive (or worse, pseudo-moralistic) attitudeof rejection of metrics, I push them to learn their tools, their limits, their risks, and some ways to mitigate risk.

My secondary goal is to deal with the serious performance problems that these students have with writing essays.

I assign two essays so that they can do one, get detailed feedback via interactive grading, and then do another. I have done this in one (1) course.

Before the first essay was due, we had a special class with a reference librarian who gave a presentation on finding software-metrics research literature in the online databases of Florida Tech’s library and we had several class discussions on the essay requirements.

The results were statistically unimpressive: the average grade went from 74.0 to 74.4. Underneath the numbers, though, is the reality that I enforced a much higher standard on the second essay. Most students did better work on Essay 2, often substantially.

The process for interactive grading was a little different. I kept the papers for a week before starting interactive grading, so that I could check them for plagiarism. I use a variety of techniques for this (if you are curious, see the video course on plagiarism-detection at http://www.testingeducation.org/BBST/engethics/). I do not do interactive grading sessions with plagiarists. The meeting with them is a disciplinary meeting and the grade is zero.

Some students skirted the plagiarism line, some intentionally and others not. In the interactive grading session, I made a point of raising this issue, giving these students feedback on what I saw, how it could be interpreted and how/why to avoid it in the future.

For me, doing the plagiarism check first is a necessary prerequisite to grading an essay. That way, I have put the question behind me about whether this is actually the student’s work. From here, I can focus on what the student did well and how to improve it.

A plagiarism check is not content-focused. I don’t actually read the essay (or at least, I don’t note or evaluate its content, I ignore its structure, I don’t care about its style apart from looking for markers of copying, and I don’t pay attention to who wrote it). I just hunt for a specific class of possible problems. Thus, when I meet with the student for interactive grading, I am still reading the paper for the first time.

During the interactive grading session, I ask the student to tell me about the metric.

One of the requirements that I impose on the students (most don’t do it) is that they apply the metric to real code that they know, so that they can see how it works. As computer science students, they have code, so if they understand the metric, they can do this. I ask them what they did and what they found. In the rare case that the student has actually done this, I ask about whether they did any experimenting, changing the code a little to see its effect on the metric. With some students, this leads to a good discussion about investigating your tools. With the students who didn’t do it, I remind them that I will be unforgiving about this in Essay 2. (Astonishingly, most students still didn’t do this in Essay 2. This cost each of them a one-letter-grade (10-point) penalty.)

From there, I typically read the paper section by section, highlighting the section that I am reading so that the student can follow my progress by looking at the display. Within about 1/2 page of reading, I make comments (like, “this is really well-researched”) or ask questions (like, “this source must have been hard to find, what led you to it?”)

The students learned some library skills in the meeting with the librarian. The undergrads have also taken a 1-credit library skills course. But most of them have little experience. So at various points, it is clear that the student has not been able to find any (or much) relevant information. This is sometimes an opportunity for me to tell the student how I would search for this type of info, and them demonstrate that search. Sometimes it works, sometimes I find nothing useful either.

In many cases, the student finds only the cheerleading of a few people who created a metric or who enthuse about it in their consulting practices. In those reports, all is sunny and light. People might have some problems trying to use the metric, but overall (these reports say), someone who is skilled in its use can avoid those problems and rely on it. In the discussion, I ask skeptical questions. I demonstrate searches for more critical commentary on the metric. I point out that many of the claims by the cheerleaders lack data. They often present summaries of not-well-described experiments or case studies. They often present tables with numbers that came from not-exactly-clear-where. They often present reassuring experience reports. Experience reports (like the one you are reading right now) are useful. But they don’t have a lot of evidence-value. A well-written experience report should encourage you to try something yourself, giving you tips on how to do that, and encourage you to compare your experiences with the reporter’s. For experience reports to provide enough evidence of some claim to let you draw some conclusions about it, I suggest that you should require several reports that came from people not associated with each other and that made essentially the same claim, or had essentially the same thing/problem arise, or reached the same conclusion. I also suggest that you should look for counter-examples. During the meeting, I might make this same point and then hunt for more experience reports, probably in parallel with the student who is on their own computer while I run the search on mine. This type of attack on the data foundation of the metric is a bit familiar to these students because we use Bossavit’s book The Leprechauns of Software Engineering: How folklore turns into fact and what to do about it, as one of our texts.

In the essay, all students have performance problems (that is, all students have problems with the generic task of writing essays and the generic task of finding information) and so (in my extensive experience — 1 course), the discussion with all students switches back and forth between the content (the analysis of the metric and of the research data) and the generic parts of the assignment.

A Few More Thoughts

You can use interactive grading with anything that you can review. For example, Keith Gallagher uses interactive grading in his programming courses, reading the student’s code with them and asking questions about the architecture, style, syntax, and ability to actually provide the benefits the program is supposed to provide. His students have spoken enthusiastically about this to me.

Dr. Gallagher ends his sessions a little differently than I do. Rather than telling students what their grade is, he asks them what they think their grade should be. In the discussion that follows, he requires the student to view the work from his perspective, justifying their evaluation in terms of his grading structure for the work. It takes some skill to pull this off, but done well, it can give a student who is disappointed with their grade even more insight into what needs improvement.

Done with a little less skill, interactive grading can easily become an interaction that is uncomfortable or unpleasant for the student. It can feel to a student who did poorly as if the instructor is bullying them. It can be an interaction in which the instructor does all the talking. If you want to achieve your objectives, you have to intentionally manage the tone of the meeting in a way that serves your objectives.

Overall, I’m pleased with my decision to introduce interactive grading to my classes. As far as I can tell, interactive grading takes about the same amount of time as more traditional grading approaches. There are several benefits to students. Those with writing or language problems have the opportunity to demonstrate more skill or knowledge than their written work might otherwise suggest. They get highly personalized, interactive coaching that seems to help them submit better work on future assignments. Most students like interactive grading and consider it a worthwhile use of their time. Finally, I think it is important to have students feel they are getting a good value for their tuition dollar. This level of constructive personal attention contributes to that feeling.

This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.

WTST 2013 Call for Participation: Teaching High Volume Automated Testing (HiVAT)

October 30th, 2012

Hello,

This is the WTST 2013 Call for Participation. Do you know someone who has been working in this area? If so, please pass this information along to them.

Thanks!

==============
12th WORKSHOP ON TEACHING SOFTWARE TESTING (WTST 2013)
JANUARY 25-27, 2013
MELBOURNE, FLORIDA
at the HARRIS INSTITUTE FOR ASSURED INFORMATION

 

WTST CALL FOR PARTICIPATION

TEACHING HIGH VOLUME AUTOMATED TESTING (HiVAT)

The Workshop on Teaching Software Testing is concerned with the practical aspects of teaching university-caliber software testing courses to academic or commercial students.

WTST 2013 is focused on high volume automated testing (HiVAT). Our goal is to bring together instructors who have experience teaching high-volume techniques or who have given serious consideration to how to teach these techniques. We also welcome participants focused on the teaching of complex cognitive concepts and the transfer of what was learned to industrial practice.

As at all WTST workshops, we reserve some seats for senior students who are strongly interested in teaching and for faculty who are starting their careers in this area or beginning a research program connected with teaching this type of material.

There is no fee to attend this meeting. You pay for your seat through the value of your participation. Participation in the workshop is by invitation based on a proposal. We expect to accept 15 participants with an absolute upper bound of 25.

BACKGROUND ON THE WORKSHOP TOPIC

High volume automated testing involves automated generation, execution and evaluation of the results of a large set of tests. This contrasts with more traditional “automated” testing that involves automated execution of a relatively small number of human-created tests.

Here are four examples of the types of problems that underlie the need for HiVAT:

  1. Many types of code weakness (such as timing-related problems) yield intermittent failures and are hard to detect with traditional testing techniques
  2. There are immense numbers of possible combination tests of several variables together. Some combinations are unique (a failure appears only on a particular combination)
  3. Some failures occur primarily when a system under test is under load and so detection and characterization is essentially a statistical challenge.
  4. Characterizing the reliability of software requires a statistically useful set of tests.

In the academic community, the most commonly discussed HiVAT family of techniques is called “fuzzing” – but fuzzing as we know it involves very simplistic evaluation of the test results—essentially run the software until it crashes or fails in some other very obvious way. Other HiVAT techniques rely on more powerful oracles and can therefore find other kinds of bugs.

WTST is about teaching testing, not creating new techniques. The challenge we are trying to address in this WTST is that many of these techniques are known but not widely applied. We believe this is because they are poorly taught. As far as we can tell, most testing courses don’t even mention these techniques. Of those that do (and go beyond fuzzing), our impression is that students come out baffled about how to actually DO that type of testing in their work.

At Florida Tech, we’re trying to address this by creating “reference implementations” for several techniques—open source demonstrations of them, with commentary on the design and implementation. We’re hoping that WTST will provide examples of other good approaches.

TO ATTEND AS A PRESENTER

Please send a proposal BY DECEMBER 1, 2012 to Cem Kaner <kaner@cs.fit.edu> that identifies who you are, what your background is, what you would like to present, how long the presentation will take, any special equipment needs, and what written materials you will provide. Along with traditional presentations, we will gladly consider proposed activities and interactive demonstrations.

We will begin reviewing proposals immediately. We encourage early submissions. It is unlikely but possible that we will have accepted a full set of presentation proposals by December 1.

Proposals should be between two and four pages long, in PDF format. We will post accepted proposals to http://www.wtst.org.

We review proposals in terms of their contribution to knowledge of HOW TO TEACH software testing. Proposals that present a purely theoretical advance in software testing, with weak ties to teaching and application, will not be accepted. Presentations that reiterate materials you have presented elsewhere might be welcome, but it is imperative that you identify the publication history of such work.

By submitting your proposal, you agree that, if we accept your proposal, you will submit a scholarly paper for discussion at the workshop by January 8, 2013. Workshop papers may be of any length and follow any standard scholarly style. We will post these at http://www.wtst.org as they are received, for workshop participants to review before the workshop.

TO ATTEND AS A NON-PRESENTING PARTICIPANT:

Please send a message by DECEMBER 1, 2012, to Cem Kaner <kaner@cs.fit.edu> that describes your background and interest in teaching software testing. What skills or knowledge do you bring to the meeting that would be of interest to the other participants?

The hosts of the meeting are:

Cem Kaner (https://kaner.com and http://www.testingeducation.org)
Rebecca Fiedler (http://bbst.info)
Michael Kelly (http://www.developertown.com/)

HOW THE MEETING WILL WORK

WTST is a workshop, not a typical conference. It is a peer conference in the tradition of The Los Altos Workshops on Software Testing (http://lawst.com). Our presentations serve to drive discussion. The target readers of workshop papers are the other participants, not archival readers. We are glad to start from already-published papers, if they are presented by the author and they would serve as a strong focus for valuable discussion.

In a typical presentation, the presenter speaks 10 to 90 minutes, followed by discussion. There is no fixed time for discussion. Past sessions’ discussions have run from 1 minute to 4 hours. During the discussion, a participant might ask the presenter simple or detailed questions, describe consistent or contrary experiences or data, present a different approach to the same problem, or (respectfully and collegially) argue with the presenter. In 20 hours of formal sessions, we expect to cover six to eight presentations. Some of our sessions will be activities, such as brainstorming sessions, collaborative searching for information, creating examples, evaluating ideas or work products. We also have lightning presentations, time-limited to 5 minutes (plus discussion). These are fun and they often stimulate extended discussions over lunch and at night.

Presenters must provide materials that they share with the workshop under a Creative Commons license, allowing reuse by other teachers. Such materials will be posted at http://www.wtst.org.

Our agenda will evolve during the workshop. If we start making significant progress on something, we are likely to stick with it even if that means cutting or time boxing some other activities or presentations.

LOCATION AND TRAVEL INFORMATION

We will hold the meetings at

Harris Center for Assured Information, Room 327

Florida Institute of Technology

150 W University Blvd

Melbourne, FL 32901

Airport

Melbourne International Airport is 3 miles from the hotel and the meeting site. It is served by Delta Airlines and US Airways. Alternatively, the Orlando International Airport offers more flights and more non-stops but is 65 miles from the meeting location.

Hotel

We recommend the Courtyard by Marriott – West Melbourne located at 2101 W. New Haven Avenue in Melbourne, FL.

Please call 1-800-321-2211 or 321-724-6400 to book your room by December 24, 2012. Be sure to ask for the special WTST rates of $93 per night. Tax is an additional 11%. All reservations must be guaranteed with a credit card by Tuesday, December 24, 2013. If rooms are not reserved, they will be released for general sale. Following that date reservations can only be made based upon availability.

For additional hotel information, please visit the hotel website at http://www.marriott.com/hotels/travel/mlbch-courtyard-melbourne-west/

Don’t censure people for disagreeing with us

October 15th, 2012

I just posted “Censure people for disagreeing with us?” to context-driven-testing.com.

I don’t usually cross-reference posts on that blog, but I feel pretty strongly about this….

 

 

 

Nominated for the “Software Test Luminary” award

September 29th, 2012

Every year, the Software Test Professionals Conference awards the “Software Test Luminary” award. I’m one of the three nominees for 2012.

Here’s a link to the announcement. If you think I should win the award, please vote for me. (Deadline: October 11, 2012).

BBST is now a registered trademark

September 20th, 2012

Kaner, Fiedler & Associates has completed the process of registering BBST as our trademark. We will be revising the course slides soon to show BBST ® (the R in the circle) and to include an acknowledgment in the acknowledgement paragraph on the front slide of each course. Please acknowledge the trademark when you publicize or write about the BBST courses. The proper notice is “BBST is a registered trademark of Kaner, Fiedler & Associates ,LLC”).

We decided to register the trademark a long time ago, when we saw a few training companies marketing “BBST training” even though they didn’t have any connection with us and, as far as we could tell, their principals had not taken BBST courses. That was reinforced as people we didn’t know approached us with marketing ideas that seemed highly inappropriate. Registration protects “BBST” from being hijacked for other purposes. Registration has finally been completed and so it is time to start applying it.

The BBST courses, including the instructor’s course and the instructor’s manual continue to be Creative Commons-licensed. Anyone can teach those materials. Anyone can advertise that they teach those materials (as long as those are the materials they actually teach). Anyone can modify the course as they see fit and advertise that they teach a course that is partially based on those materials.  Nothing in the registration of BBST limits any of these things. It merely protects “BBST” from inappropriate use (i.e. use that goes beyond these extremely broad limits).

If you have specific questions about the BBST trademark, please ask.

Instructor’s Manual for the BBST Courses

September 13th, 2012

 

The BBST course series is open source — anyone can download the materials. Anyone can teach the courses.

Dr. Rebecca L. Fiedler designed the BBST Instructors Course to teach people who had taken BBST how to teach it. Over several years, we’ve been working on an Instructors’ Manual to support the course, publishing several drafts for review.

Dr. Fiedler, Doug Hoffman and I have finally finished the manual, a 357-page book. You can find it here, at the web page for the BBST Instructors Course. (Like BBST, the BBST instructors course materials are available to the world for free. Enjoy!)

As with all of the BBST work, we thank the National Science Foundation, for its support: grants EIA-0113539 ITR/SY+PE Improving the Education of Software Testers and CCLI-0717613 Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. (The views expressed in BBST and this blog reflect the opinions of the authors and not NSF.)

The Oracle Problem and the Teaching of Software Testing

September 11th, 2012

When I first studied testing, I learned that a test involved comparison of the test result to an expected result. The expected result was the oracle: the thing that would tell you whether the program passed or failed the test. We generalized this a bit, especially for automated testing–we would look to a reference program as our oracle. The reference program is a program that generates the expected results, rather than the results themselves. But the idea was the same: testing involved comparison with a known result.

The Oracle Problem

Oracle is an interesting choice of terminology, because the oracles of Greece (the original “oracles”) were mythological. And Greek tragedies are full of stories of people who misinterpreted what an oracle told them, and behaved (on the basis of their understanding) in ways that brought disaster on them.

If we define a software testing oracle as a tool that tells you whether the program passed your test, we are describing a myth–something that doesn’t exist. Relying on the oracle, you might make either of the classic mistakes of decision theory:

  • The miss: you believe the program has passed even though it did something wrong.
  • The false alarm: you believe the program has failed even though it has behaved appropriately.

So we soften the definition: a software testing oracle is a tool that helps you decide whether the program passed your test.

Seen this way, oracles are heuristic devices: they are useful tools that help us make decisions, but sometimes they point us to the wrong decision.

If you don’t have authoritative oracles (“authoritative” = an oracle that is always correct), then how can you test? How can you specify a test in a way that a junior tester or a computer can run the test and correctly tell you whether the program passed it?

The Instructional Problem

I’ve been emphasizing the oracle problem in my testing courses for about a dozen years. I see this as one of the defining problems of software testing: one of the reasons that skilled testing is a complex cognitive activity rather than a routine activity. Most of the time, I start my courses with a survey of the fundamental challenges of software testing, including an extended discussion of the oracle problem.

If you’ve seen the BBST-Foundations courses (for example, the Association for Software Testing teaches a version of this course), you’ve seen my introduction to the oracle problem and to heuristic oracles. Students typically work through one or more theoretical and/or practical labs in BBST. Once they understand the oracle problem, the course presents two approaches for using oracles:

  • One approach, that I associate with James Bach and Michael Bolton, lists 8 general types of expectations. For example, we expect a product to operate consistently across all of its features. We expect it to operate consistently from version to version, etc. (I’ll list the set of consistencies later.)
  • A different approach, that I associate with Doug Hoffman (but I think it’s been independently developed and followed by lots of people), lists specific heuristics. None of them is complete. Each focuses on a specific prediction about the results of a test and ignores other aspects of the test. For example, if we are testing a program that does calculations, checking whether it says 2+3=5 (5 is the expectation) is not complete, but it is useful. Testing whether we can invert an operation (take the square root of a square of a number) isn’t a complete test of a square-a-number function, but it is useful. (More examples below…)

Bach and Bolton have done a good job of explaining their approach. Mike Kelly provides a great summary of it. I follow an explanatory approach that I learned from an early version of Bach’s RST course in my lectures and it works well. Students understand the consistencies and (generally) find them compelling. The list feels complete–any specific oracle you can think of can be classified as an example of one of their consistencies.

I think there are three problems with Bach and Bolton’s consistencies:

  1. These provide a useful way to think about a bug after you find it and try to report it. The list of consistencies can structure your thinking as you try to figure out how to explain to someone else why a particular program behavior feels wrong (what feels wrong about it?). However, even though I do find them helpful for evaluating test results, I don’t find them helpful for designing tests.
  2. I think they are particularly worthless for designing automated tests. Automated testing depends on oracles–the automated-testing-program that runs a zillion tests has to decide whether the software under test passed each test or not. The consistencies don’t guide testers (not me, not the testers I know, not the students I teach) toward oracle ideas that are specific enough to be programmed and used by an automaton.
  3. In my courses, the consistencies capture my students’ imagination and interfere with their thinking about oracles that would be useful for designing tests, especially automated tests.

It is the third problem that I have been wrestling with for several years and that will cause me to rewrite the BBST-Foundations course.

In exam after exam, when I give students a specific scenario that clearly involves automated testing and ask them to suggest oracles they would use to support their automated testing and how they would use them, they ramble through a memorized list of consistency heuristics and don’t come up with ideas–some students don’t come up with any ideas–for oracles that support the automation.

  • I tried to correct this problem by raising the problem in supplementary lectures. It didn’t work.
  • I went even further, telling students that this was a classic problem in this course and they needed to answer questions about oracles in test design with specific oracles. It didn’t work.
  • I went even further, telling students as part of the exam question itself that this question called for specific ideas about the oracles they would design into specific tests and they shouldn’t rely on general descriptions of consistencies. It didn’t work.
  • Even when I give a set of exam questions to students in advance, and they draft an answer in advance with full benefit of time, course notes, lecture notes and videos, discussions with each other, and anything they can find on the web–even when these questions have cautionary notes about this being a question about test automation and they shouldn’t just present a consistency oracle–it still didn’t work.

They just keep giving back worthless ideas about test design and flunk that part of the exam.

Feh!

An Important Instructional Heuristic

When a few students give bad answers on an exam, the problem is in the students. They don’t understand the material well enough.

When a lot of students give bad answers on the exam, the problem is in the instruction. It’s the responsibility of the teacher to troubleshoot and fix this.

When a lot of students give bad answers that are weak in a consistent way, something specific in the instruction leads them down that path. In my experience, that something is often something the instructor is particularly attached to.

I like the consistencies a lot. But I think that in the Foundations course, they are an attractive nuisance (an almost-irresistable invitation to take a hazardous or counterproductive path).

Therefore

  • In the next generation of the Foundations course, I will probably drop the oracle consistencies approach altogether.
  • In the next generation of the Bug Advocacy course, I will probably add the oracle consistencies as a useful tool for persuasive bug report writing.

Appendix: More Details on Oracles

  1. The underlying problem: Oracles are necessarily incomplete
  2. Oracles are heuristics
  3. Bach and Bolton’s consistencies
  4. Hoffman’s approach
  5. Applying this to test automation

Oracles are Necessarily Incomplete

Back when dinosaurs roamed the earth, some testasauruses theorized that a properly designed software test involves:

  • a set of preconditions that specify the state of the software and system when you start the test
  • a set of procedures that specify what you do when you do the test
  • comparison of what the software under test does with a set of postconditions: the predicted state of the system under test after you run the test. This set of postconditions make up the expected results of the test.

We can call the postconditions the oracle or we can say that a program that generates the expected results is the oracle, but in either case, the testasauruses said, good testing involves comparing the program’s test behavior to expected results, and to do good testing, you need an oracle. (Fossils from this era have been preserved in IEEE Standard 829 on software test documentation.)

Elaine Weyuker’s (1980) On Testing Nontestable Programs shattered that view. Weyuker argued that “it is unusual for … an oracle to be pragmatically attainable or even to exist” (p. 3). Instead, she said, testers rely on partial oracles. For example:

  • A tester might recognize a result of a calculation as impossibly large even though she doesn’t know what the exact result should be. (You might not know offhand what 1.465732 x 2.74312 is, but if a program said 7,000,000 you could reject that as obviously wrong without doing any calculations.)
  • A tester might recognize behavior as inappropriate, even if she doesn’t know exactly how the program should behave.

Weyuker’s paper wasn’t widely noticed in the practitioner community. I don’t think we appreciated the extent of this problem until the Quality Week conference in 1998, when Doug Hoffman (A Taxonomy for Test Oracles) explained this problem and its implications this way:

Suppose that we specify a test by describing

    • the starting state of the system under test
    • the test inputs (the data and operations you use to carry out the test)
    • the expected test outputs

We can still make mistakes in interpreting the test results.

    • We might incorrectly decide that the program passed the test because its outputs matched the expected outputs but it misbehaved in some other way. For example, a program that adds 2+2 might get 4, but it is clearly broken in some way if it takes 10 hours to get that result of 4.
    • We might incorrectly decide that the program failed the test because its outputs did not match the expected results, but on more careful examination, we might realize that it did the right thing. For example, imagine testing to a network printer with the expectation that the printer will page a specific page within 1 minute–but during the test, another computer sent a long document to the printer and so it didn’t actually get to the test document for a long time. This might be the exactly correct behavior under the circumstances, but it doesn’t match the expectation.

Most testers, doing manual testing, would probably not make either mistake. But an automated test would make both mistakes. So would a manual tester who was trying to exactly follow a fully-detailed script.

Doug argued that both types of mistakes were inevitable in testing because no one could fully specify the starting state of the system and no one could fully specify the ending state of the system. There are too many potentially-relevant variables. For example, suppose in your 2+2 test, you do specify the expected time for the test to complete:

  • Did you specify the contents of the stack? What if the program adds stuff to the stack but doesn’t remove it, or corrupts the stack in some other way?
  • Did you specify the contents of memory? Memory leaks are common bugs. And buffer overflows are a common example of a class of bug that corrupts memory.
  • Did you specify the contents of the hard disk? What if the program saves something or deletes something?
  • Did you specify what the printer would do during the test? What if the program sends something to the printer, even though it is not supposed to, or sends unauthorized email, etc.?

If you don’t have experience thinking about the diversity of ways that something can go wrong, but you have a bit of technical savvy, the Hewlett-Packard printer diagnostics can be eye-opening. You can find documentation of these in Management Information Bases (MIB’s) published by HP. I find these at https://spp.austin.hp.com/SPP/Public/Sdk/SdkPublicDownload.aspx but if this source goes away, you can find third party sites like OiDView. For example, the MIB file for the LaserJet 9250c runs 8506 lines, documenting 176 commands, many of them with many possible parameters. A program can go wrong in hundreds (or thousands) of different ways.

From a diagnostic point of view, imagine running a test and checking the state of the printer. For example, you might check how much free memory there is, or how long the last command took to execute, or the most recent internal error code. Each diagnostic command that you run changes the state of the machine, and so the results of the next diagnostic are no longer looking at the system as it was right after the test completed.

So in practical terms, even if you could fully specify the state of the system after a test (you can’t, but pretend that you could), you still couldn’t check whether the system actually reached that state after the test because each of the diagnostics that you would run to check the state of the system would change the state. The next diagnostic tests the machine that is now in a different state. In practical terms, you can only run a few diagnostics as part of a test (maybe just one) before the diagnostics stop being informative. If  these diagnostics don’t look for a problem in the right places, you won’t see it. This is sometimes called the Heisenbug problem, in honor of the Heisenberg Uncertainty Principle.

Oracles are Heuristics

Hoffman argued that no oracle can fully specify the postcondition state of the system under test and therefore no oracle is complete. Given that an oracle is incomplete, you might use the oracle and incorrectly conclude that the program failed the test when it didn’t or passed the test when it didn’t. Either way, reliance on an oracle can lead you to the wrong conclusion.

A decision rule that is useful but not always correct is called a heuristic.

My favorite presentations of the ideas underlying Heuristics were written by Billy V. Koen. See his book (I prefer the shorter and simpler ASEE early edition used in introductory engineering courses, but the current version is good too) and a wonderful historical article that he wrote for BBST.

The Bach / Bolton Consistency Heuristics

Imagine running a test. The program misbehaves. The tester notices the behavior and recognizes that something is wrong. What is it that makes the tester decide this is wrong behavior?

In Bach’s view (as I understand it from talking with him and teaching about this with him), what happens is that the tester makes a comparison between the behavior and some expectations about the ways the program should (or should not) behave. These comparisons might be conscious or unconscious, but Bach posits that they must happen because every explanation of why a program’s behavior has been evaluated as a misbehavior can be mapped to one of these types of consistency.

Here’s the list:

  • Consistent within product: Function behavior consistent with behavior of comparable functions or functional patterns within the product.
  • Consistent with comparable products: Function behavior consistent with that of similar functions in comparable products.
  • Consistent with history: Present behavior consistent with past behavior.
  • Consistent with our image: Behavior consistent with an image the organization wants to project.
  • Consistent with claims: Behavior consistent with documentation, specifications, or ads.
  • Consistent with standards or regulations: Behavior consistent with externally-imposed requirements.
  • Consistent with user’s expectations: Behavior consistent with what we think users want.
  • Consistent with purpose: Behavior consistent with product or function’s apparent purpose.

(If you reorder the list, you can use a mnemonic abbreviation to memorize it: HICCUPPS.)

For example, imagine that there is a program specification and that the program behaves differently from what you would predict from the specification. The behavior might be reasonable, but if it contradicts the specification, you should probably write a bug report. Your explanation of the problem in the report wouldn’t be “this is bad”. It would be “this is bad because it is inconsistent with the specification.”

The list is designed to cover every type of consistency-expectation that testers rely on. If they realize the list is incomplete, they add a new type.

For the sake of argument, I will assume that this list is complete, i.e. that every rationale that a tester provides for why a program is misbehaving can be mapped to one of these 8 types of consistency.

I have seen it argued (mainly on Twitter) that this is the “right” list. That every other oracle can be mapped to this list (this oracle tests for this type of inconsistency) and therefore they are all special cases. If you know this list, the argument goes, you can derive (or imagine) (or something) all the oracles from it.

As far as I know, there is no empirical research to support the claim that testers in fact always rely on comparisons to expectations or that these particular categories of expectations map to the comparisons that go on in testers’ heads.

  • That assertion does not match my subjective impression of what happens in my head when I test. It seems to me that misbehaviors often strike me as obvious without any reference to an alternative expectation. One could counter this by saying that the comparison is implicit (unconscious) and maybe it is. But there is no empirical evidence of this, and until there is, I get to group the assertion with Santa Claus and the Tooth Fairy. Interesting, useful, but not necessarily true.
  • The assertion also does not match my biases about the nature of concept formation and categorical reasoning. As a graduate student, I studied cognition with Professor Lee R. Brooks. Some of his most famous work was on nonanalytic concept formation (see his 1978 chapter In Rosch & Lloyd’s classic Cognition & Categorization or his paper with Larry Jacoby (1984) on Nonanalytic cognition. A traditional view of cognition holds that we make many types of judgment on the basis of rules that put things into categories–something is this or that because of a set of rules that we consult either consciously or unconsciously. Bach and Bolton’s consistencies are examples of the kinds of categories that I think of when I think of this tradition. A very different view holds that we make judgments on the basis of similarity to exemplars. (An exemplar is a memorable example.) A person can learn arbitrarily many exemplars. Experts have probably learned many more than nonexperts and so they make better evaluations. One of the most interesting experiments in Lee’s lab required the experimental subject to make a judgment (saying which category something belonged to) and explain the judgment. The subjects in the experiments described what they said were their decision rules to explain each choice. But over a long series of decisions, you can ask whether these rules actually describe the judgments being made. The answer was negative. The subject would describe a rule that recently he hadn’t followed and that he would again not follow later. Instead, the more accurate predictor of his decisions was the similarity of the thing he was categorizing to other things he had previously categorized. It appeared that unconscious processing was going on, but it was  nonanalytic (similarity-based), not analytic (rule-based). I found, and still find, this line of results persuasive.

A list can be useful as a heuristic device, as a tool that helps you consciously think about a problem, whether the list describes the actual underlying psychology of testing or not.

But if it is to be a good heuristic device, it has to be more useful than not. As a tool for teaching oracles as part of test design, my experience is that the consistency list fails the utility criterion.

I don’t have any scientific research to back up my conclusion, just a lot of personal experience. But when dealing with a heuristic device that is not backed up by any scientific research (just a lot of personal experience), I get to rely on what I’ve got.

Doug Hoffman’s Approach

I first saw Doug talk about oracles in 1998, at Quality Week. That was the start of a long series of publications on oracles and the use of oracles in test automation. Along with the papers, I have the benefit of having taught courses on test automation with Doug and having talked at length with him while he struggled to get his ideas on paper.

Doug made two key points in 1998:

  • All oracles are heuristic (we’ve already covered that ground)
  • There are a lot of incomplete oracles available. Given that we have to rely on incomplete oracles (because no oracles are complete), we should think about what combinations of oracles we can use to learn interesting things about the software.

Doug’s work was so striking that we opened the Fifth Los Altos Workshop on Software Testing with it. That meeting became an intense, 2-day long, moderated debate between Doug and James Bach. We learned so much about managing difficult debates in that meeting that we were able to create what I think of as the current structure of LAWST, adopted in LAWST 6.

Doug has published several lists of specific types of oracles. Unfortunately, each of the ones I’ve read has its own idiosyncrasies that can be confusing, so I won’t try to restate them. Instead, I’ll work from BBST-Foundations-2013 (in preparation), which refines a list that I prepared for the current BBST Foundations with Doug’s coaching.

  • We use the constraint oracle to check for impossible values or impossible relationships. For example an American ZIP code must be 5 or 9 digits. If you see something that is non-numeric or some other number of digits, it cannot be a ZIP code. A program that produces such a thing as a ZIP code has a bug.
  • We use the regression oracle to check results of the current test against results of execution of the same test on a previous version of the product.
  • We use self-verifying data as an oracle. In this case, we embed the correct answer in the test data. For example, if a protocol specifies that when a program sends a message to another program, the other one will return a specific response (or one of a few possible responses), the test could include the acceptable responses. An automated test would generate the message, then check whether the response was in the list or was the specific one in the list that is expected for this message under this circumstance.
  • We use a physical model as an oracle when we test a software simulation of a physical process. For example, does the movement of a character or object in a game violate the laws of gravity?
  • We use a business model the same way we use a physical model. If we have a model of a system, we can make predictions about what will happen when events X take place. The model makes predictions. If the software emulates the business process as we intend, it should give us behavior that is consistent with those predictions. Of course, as with all heuristics, if the program “fails” the test, it might be the model that is wrong.
  • We use a statistical model to tell us that a certain behavior or sequence of behaviors is very unlikely, or very unlikely in response to a specific action. The behavior is not impossible, but it is suspicious. We can test whether the actual behavior in the test is within the tolerance limits predicted by the model. This is often useful for looking for patterns in larger sets of data (longer sequences of tests). For example, suppose we expect an eCommerce website to get 80% of its customers from the local area, but in beta trials of its customer-analysis software, the software reports that 70% of the transactions that day were from far away. Maybe this was a special day, but probably this software has a bug. If we can predict a statistical pattern (correlations among variables, for example), we can check for it.
  • Another type of statistical oracle starts with an input stream that has known statistical characteristics and then check the output stream to see if it has the same characteristics. For example, send a stream of random packets, compute statistics of the set, and then have the target system send back the statistics of the data it received. If this is a large data set, this can save a lot of transmission time. Testing transmission using checksums is an example of this approach. (Of course, if a message has a checksum built into the message, that is self-verifying data.)
  • We use a state model to specify what the program does in response to an input that happens when it is in a known state. A full state model specifies, for every state the program can be in, how the program will respond (what state it will transition to) for every input.
  • We can build an interaction model to help us test the interaction between this program and another one. The model specifies how that program will behave in response to events in (actions of) this program and how this program will behave in response to actions of the other program. The automaton triggers the action, then checks the expected behavior.
  • We use calculation oracles to check the calculations of a program. For example, if the program adds 5 numbers, we can use some other program to add the 5 numbers and see what we get. Or we can add the numbers and then successively subtract one at a time to see if we get a zero.
  • The inverse oracle is often a special case of a calculation oracle (the square of the square root of 2 should be 2) but not always. For example, imagine taking a list that is sorted low to high, sorting it high to low and then sorting it low to high. Do we get back the same list?
  • The reference program generates the same responses to a set of inputs as the software under test. Of course, the behavior of the reference program will differ from the software under test in some ways (they would be identical in all ways only if they were the same program). For example, the time it takes to add 1000 numbers might be different in the reference program versus the software under test, but if they ultimately yield the same sum, we can say that the software under test passed the test.

You can probably imagine lots of other possibilities for this list.

Applying This to Automation

What is special about these oracles is that they are programmable. You can create automated tests that will check the behavior of the program against the result predicted by (or predicted against by) any of these oracles or by any (well, probably, almost any) combination of these oracles.

Given a programmable oracle you can do high volume automated testing. Have the test-design-and-execution program randomly generate inputs to the software under test and check whether the software responds the way the oracle predicts. You might use some type of model to drive the random number generator (making some events more likely than others). You might randomly create a long random sequence of tests (e.g. regression tests) by randomly selecting which test to run next from a pool of already-built tests (each of which has an expected result that you can check against). Given an oracle, you can detect whatever failures that oracle can expose. For example, you might test with several oracles:

  • one oracle predicts how long an operation should take (or a range of possibility). If the program takes substantially more or less time, that’s a problem.
  • another oracle can predict the calculation result of the operation (or the functional result if you’re doing something else, like sorting, that isn’t exactly a calculation)
  • another oracle might predict the amount of free memory, or at least might tell you whether a large data set (or memory-intensive calculation) should fit in memory. If so, you can detect memory leaks this way.

No matter what combination you choose, you will miss some types of errors. You cannot test all the dimensions of the result of a test with any oracle or any combination of oracles. But if you test a feature by machine using some oracles, then when you test that feature with a human painstakingly designing and running each test individually, that person will know that she doesn’t have to waste time checking whether certain types of bugs are there or not, because if they were there, the automaton would already have exposed them.

An Exam Question

Here’s an exam question (that students in the current version of BBST Foundations have often handled poorly):

Suppose you have written a test tool that allows you to feed commands and data to Microsoft Excel and to Open Office Calc and to see the results. The test tool is complete, and it works correctly. You have been asked to test a new version of Calc and told to automate all of your testing. What oracles would you use to help you find bugs? What types of information would you expect to get with each oracle?

Note: Don’t just echo back a consistency heuristic. Be specific in your description of a relevant oracle and of the types of information or bugs that you expect.

Look back at the Hoffman list and think of what oracles you could use for this test, to facilitate extensive automated testing.

Now look further back to the Bach / Bolton list and think of what oracles their lists suggests, that would work well for designing extensive automated testing.

For me, the Hoffman list works better. (And if I thought about additional oracles that were specific enough to support automation, I would add them to the Hoffman list, growing it into something that gets longer and longer the more I use it.) What about for you?

This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.

 

CAST 2012 Metrics Talk Posted

August 7th, 2012

Title slide

I’ve posted the video and slide deck for the metrics talk Nawwar and I did at CAST. I hope you enjoy them.

 

Theses and Dissertations About Software Testing

April 17th, 2012

This week, Florida Tech’s Center for Software Testing Education & Research (my lab) published a bibliography of dissertations and theses focused on software testing (http://www.zotero.org/groups/cster_dissertations). To use the bibliography, you need an open source (free) tool called Zotero (http://www.zotero.org/)

What are these documents?

Theses are research reports written by graduate students as a final requirement for graduation. Doctoral theses are also called dissertations.

The typical thesis includes:

  • a literature review that describes a significant set of related research published by others
  • an idea (for example, an idea about how to improve testing, or how to create and assess a testing tool, or how to study how testing is really done, or how to teach it more effectively, or how to prove that some other idea about testing is wrong)
  • a description of the thesis methodology and technology (for example, how a test tool was designed, implemented, and studied)
  • a description of the results of the study.

Theses are evaluated by professors who are experts in the discipline and at least one who is not. For example, the supervisory committee for a doctoral student in Computer Science might include three professors of Computer Science, one professor of Biology and one of Business Administration. Of the three computer scientists, one (or two) would typically be expert in the subject matter of the thesis (e.g. software testing) and the others would probably be experts in other areas of computing that the work depends on. for example, a dissertation focused on testing databases might be supervised by an expert in testing, an expert in databases and an expert in research design.

Why I recommend them

Theses are designed to be read by someone who is not an expert in the field. Therefore, a thesis will typically organize a testing problem–including the most relevant research papers–in a way that a student or a mid-level testing practitioner can understand.

Of course, theses vary in quality. Some are written poorly. Some are researched poorly. Many present half-baked ideas (this is student work, not the work of an experienced practitioner or a professional researcher). But overall, I have found them good starting points when I start working in a new area or when I assign a student to an area that is new to her.

What’s in the bibliography

We have over 700 references, most before 2007.

Each reference includes the basic bibliographic information (author, title, etc.). It also includes:

  • a URL.
    • If we found a free copy of the thesis online, we point to that.
    • If not, then if the thesis is listed in WorldCat, we point to that. WorldCat indexes many of the world’s public libraries. If your public or university library is on the Interlibrary Loan system, WorldCat will tell your reference librarian what library has a copy of the thesis, so you can borrow it. Interlibrary Loans are often free to the borrower. It’s not as convenient as free-on-the-web, but it’s still free.
    • If it’s not listed on WorldCat, we point to ProQuest (we often point to ProQuest in the notes as well). You might know this branch of ProQuest as University Microfilms. You can order dissertations from ProQuest but this is not always free (PQDT Open and ProQuest with Google Scholar publish some dissertations for free, possibly including some that we thought were available only commercially). Prices vary. I think $39 is a typical number. Because theses are of variable quality, I strongly suggest that you preview as much as you can (you can often download a chapter for free from ProQuest) or read an article that summarizes the thesis (see next section) before paying $39.
  • a related reference
  • an abstract (a short summary of the thesis)
    • We chose not to copy abstracts from Dissertations Abstracts (ProQuest) because we don’t want to risk a copyright fight. If we found a copy of a thesis online or if an author posted a copy of their thesis abstract online, we copied that abstract into the bibliographic record for the thesis.

Digging up this extra information takes a lot of time and painstaking work. We’re continuing to add more recent work, and expect to grow the collection significantly over the summer.

Acknowledgments

This bibliography was created primarily by Karishma Bhatia, Casey Doran, Pat McGee, Kasey Powers, Andy Tinkham, and Patricia Terol Tolsa.

This bibliography is a product of research that was supported by NSF Grants EIA-0113539 ITR/SY+PE: “Improving the Education of Software Testers” and CCLI-0717613 “Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing.” Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.