Archive for the ‘software metrics’ Category

Updating to BBST 4.0: What Should Change

Friday, July 15th, 2016

This is the 4th section of my post on BBST 4.0. The other parts are at:

What We Think Should Change

We are satisfied with the course standards and the supporting objectives (teaching students better learning skills). We want to make it easier for students to succeed by improving how we teach the course, but our underlying criteria for success aren’t going to change.

In contrast, we are not satisfied with the central content. Here are my notes:

1. Information objectives drive the testing mission and strategy

When you test a product, you have a defining objective. What are you trying to learn about the product? We’ll call this your information objective. Your mission is to achieve your information objective. So the information objective is more about what you want to know and the mission is more about what you will do. Your strategy describes your overall plan for actually doing/achieving it.

This is where we introduce context-driven testing. For example, sometimes testers will organize their work around bug-hunting but other times they will organize around getting an exciting product into the market as quickly as possible and making it improvable as efficiently  as possible. The missions differ here, and so do the strategies for achieving them.

Critique:

  • The distinction between information objective and mission is too fine-grained. It led to unhelpful discussions and questions. We’re going to merge the two concepts into “mission.”
  • The presentation of context got buried in a mass of other details. We gave students an assignment, in which we described several contexts and they had to discuss how testing would differ in them. This was too hard for too many students. We have to provide a better foundation in the lecture and reading.
  • We must present more contexts and/or characterize them more explicitly. Back when we created BBST 3.0, we treated some development approaches as ideas rather than as contexts. The world has moved on, and what was, to some degree, hypothetical or experimental, has become established. For example:
    • In BBST 3.0, when we described a testing role that would fit in an agile organization, we avoided agile-development terminology (for reasons that no longer matter). In retrospect, that decision was  badly outdated when we made it. Several different contexts grew out of the Agile Manifesto. Books like Crispin & Gregory illustrate the culture of testing in those situations. In BBST 4.0, we will treat this as a normal context
    • Similarly, a very rapid development process is common in which you ship an early version, polish aggressively based on customer feedback, and make ongoing incremental improvements rapidly. Whether we think this approach is wonderful or not, we should recognize it as a common context. A context-respecting tester who works for, or consults to, a company that develops software this way is going to have to figure out how to provide valuable testing services that fit within this model.
  • Context-driven testing isn’t for everyone. Context-driven testing requires the tester to adapt to the context—or leave. There are career implications here, but we didn’t talk much about them in BBST 3.0. The career issues were often visible in the classes, but there was no place in the course to work on them. This section (mission drives strategy) isn’t the place for them, but the course needs a place for career discussion.

Tentative decisions:

  • Continue with this as the course opener.
  • Merge information objectives and mission into one concept (mission) which drives strategy.
  • Tighten up on the set of definitions. Definitions / distinctions that aren’t directly applicable to the course’s main concepts (such as the distinction between black box testing and behavioral testing) must go away.
  • Create a separate, well-focused treatment of career paths in software testing. (This will become a supplementary video, probably presented in the lesson that addresses test automation.)

2. Oracles are heuristic

An oracle is a mechanism for determining whether a program passed or failed a test. In 1980, Elaine Weyuker pointed out that perfect oracles rarely exist and that in practice, we rely on partial oracles. Weyuker’s work wasn’t noticed by most of us in the testing community. The insight didn’t take hold until Doug Hoffman rediscovered it, presenting his Taxonomy of Test Oracles at Quality Week and then at an all-day session at the Los Altos Workshop on Software Testing (LAWST 5).

Foundations presented oracles from two conflicting perspectives, (a) Hoffman’s (and mine) and (b) James Bach & Michael Bolton’s.

  • We started with an example from Bach’s Rapid Software Testing course, then presented the Bach and Bolton’s “heuristic oracles” (or “oracle heuristics”). A heuristic is a fallible but reasonable and useful decision rule. The heuristic aspect of a heuristic oracle is the idea that the behavior of software should usually (but not always) be consistent with a reasonable expectation. Bach developed a list of reasonable expectations, such as the expectation that the current version of the software will behave similarly to a previous version. This is usually correct, but sometimes it is wrong because of design improvement. Thus it is a heuristic. After the explanation, students worked through a challenging group assignment.
  • Next, we presented Hoffman’s list of partial oracles and mentioned that these are useful supports for automated testing. We gave students some required readings, plus quiz questions and exam study guide questions but no assignment.

Critique:

This material worked well for RST. It did not work well in BBST. I published a critique: The Oracle Problem and the Teaching of Software Testing. and a more detailed analysis in the Foundations of Software Testing workbook. Here are some of my concerns:

  • The terminology created at least as much confusion as insight. When we tested them, students repeatedly confused the meaning of the word “oracles” and the meaning of the word “heuristics.” Many students treated the words as equivalent and made statements like “all heuristics are oracles” (which is, of course, not true).
  • The terminology is redundant and uninformative. Saying “heuristic oracle” is like saying “testly test.” The descriptor (heuristic, or testly) adds no new information. The word heuristic has been severely overused in software testing. A fallible decision rule is a heuristic. A cognitive tool that won’t always achieve the desired result is a heuristic. A choice to use a technological tool that won’t always achieve the desired result is a heuristic. Every decision testers make is rooted in heuristics because all of our decisions are made under uncertainty. Every tool we use can be called a heuristic because they are all imperfect and, given the impossibility of complete testing and the impossibility of perfect oracles, every tool we will ever use will be imperfect. This isn’t unique to software testing. Long before we started talking about this in software testing, Billy V. Koen’s award-winning introductions to engineering (often taught in general engineering courses) pointed this out that engineering reasoning and methods are rooted in heuristics. See Koen’s work here (my favorite presentation) and here and here. There is nothing more heuristic-like about oracles than there is about any other aspect of test design, or engineering in general.
  • Heuristic is a magic word. Magic words provide a name for something while relieving you of the need to think further about it. The core issue with oracles is not that they are fallible. The core issue is that they are incomplete.
    • The nature of incompleteness: An oracle will focus a test on one aspect of the software under test (or a few aspects, but just a few). Testers (and automated tests) won’t notice that the program behaves well relative to those aspects but misbehaves in other ways. Therefore, you can notice a failure if it is one that you’re looking for, but you never know whether the program actually passed a test. You merely learn that it didn’t fail the test.
    • Human observers don’t eliminate this incompleteness. If you give a test to a human observer, with a well-defined oracle, they are likely to behave like a machine. They pay attention to what the oracle steers them to and don’t notice anything else. The phenomenon of not noticing anything else has been studied formally (inattentional blindness). If you don’t steer the observer with an oracle, you can get more diversity. If multiple people do the testing, different people will probably pay attention to different things, so across a group of people you will probably see greater coverage of the variety of risks. However, each individual is a finite-capacity processor. They have limited attention spans. They can only pay attention to a few things. When people split their attention, they become less effective in each task. An individual observer can introduce variation by paying attention to different potential problems in different runs of the same test.
    • I don’t think you achieve effective oracle diversity by choosing to “explore” (another magic word, though one that I am more fond of). I think you achieve it by deliberately focusing on some things today that you know you did not focus on yesterday. We can do that systematically by intentionally adopting different oracles at different times. Thinking of it this way, oracle diversity is at least as much a matter of disciplined test design as it is a basis for exploration. (I learned this, and its importance for the design of suites of automated tests, from Doug Hoffman.)
    • No aspect of the word “heuristic” takes us into these details. The word is a distraction from the question: How can we compensate for inevitable incompleteness, (a) when we work with automated test execution and evaluation and (b) when we work with human observers?
  • There are two uses of oracles. We emphasized the one that is wrong for Foundations:
    1. The test-design oracle: An oracle is an essential part of the design of a test, or of a suite of tests. This is especially important in the design of automated tests. The oracle defines what problems the tests can see (and what problems they are blind to). Anyone who is learning how tests are designed should be learning about building suites of imperfect tests using partial oracles.
    2. The tester-expectations oracle: An oracle can be a useful component of a bug report. When you find a bug, you usually have to tell someone about it, and the telling often includes an explanation of why you think this behavior is wrong, and how important the wrongness is. For example, suppose you say, “this is wrong because it’s different from how it used to be.” You are relying on an expectation here and treating it as an oracle. The expectation is that the program’s behavior should stay the same unless the program is intentionally redesigned.

    James Bach originally developed his list of tester expectations as a catalog of patterns of explanations that testers gave when they were asked to explain their bug reports. He recharacterized them as oracle heuristics (or heuristic oracles) years later. We think these are useful patterns, and when we create Bug Advocacy 4.0, we will probably include the ideas there (but maybe without the fancy oracular heuristical vocabulary).

  • In Foundations, Bach and Bolton’s excellent presentation of tester-expectation oracles drowned out most students’ awareness of test-design oracles. That is, what we see on Foundations 3.0 exams is that many students forget, or completely ignore, the course’s treatment of partial oracles, remembering only the tester-expectation oracles.

Tentative decisions:

  • All of the “heuristic” discussion of oracles is going away from Foundations.
  • We will probably introduce the notion of heuristics somewhere in the BBST series. There is value in teaching people that testers’ reasoning and decisions are rooted in heuristics, and that this is normal in all applied sciences. However, we will teach it as its own topic. It creates a distraction when pasted onto oracles.
  • We will tie the idea of partial oracles much more tightly to the design of suites of automated tests. We will focus on the problem that all oracles are incomplete, then transition into the questions:
    • What are you looking for with this (test or) series of tests?
    • What are these tests blind to?
    • How will you look for those other potential problems?
    • Can you detect these potential problems with code?
    • If not, can you train people to look for them?
    • Which problems are low-priority enough, or low-likelihood enough, that you should choose to run general exploratory tests and rely on luck and hope to help you stumble across them if they are there?
  • Years ago, many test tool vendors (and many consultants) promised complete test automation from tools that were laughably inadequate for that purpose. These people promised “complete testing” that would replace all (or almost all) of your expensive manual testers. Several of us (including me) argued that these tools were weaker than promised. We pointed out that the tools provided a very incomplete look at the behavior of the software and even that often came with an excessively-high maintenance cost. These discussions were often adversarial and unpleasant. Sometimes very unpleasant. Arguments like these leave mental scars. I think these old scars are at the root of some of  the test-automation skepticism that we hear from some of our leading old-timers (and their students). I think BBST can present a much more constructive view of current automation-support technology by starting from the perspective of partial oracles.
  • Knott’s book on Mobile Application Testing provides an excellent overview of the variety of test-automation technologies, contrasting them in terms of the basis they use for evaluating the software (e.g. comparison to a captured screen) and the costs and benefits of that basis. These are all partial oracles. They come with their own costs, including tool cost, implementation difficulty, and maintenance cost and they come with their own limitations (what you can notice with this type of oracle and what you are blind to). I think this presentation of Knott’s, combined with updated material from Hoffman will probably become the core of the new oracle discussion.
  • I think this is where we’ll treat automation issues more generally, discussing Cohn’s test automation pyramid (see Fowler too) (the value of extensive unit testing and below-the-UI integration testing) and Knott’s inverted pyramid. If I spend the time on this that I expect to spend, then along with describing the pyramid and the inverted pyramid and summarizing the rationales, I expect to raise the following issues:
    • For traditional applications (situations in which we expect the pyramid to apply), I think the pyramid model substantially underestimates the need for end-to-end testing. For example:
      • Performance testing is critical for many types of applications. To do this well, you need a lot of automated, end-to-end tests that model the behavior patterns of different categories of users.
      • Some bugs are almost impossible to discover with unit-level or service-level automated tests or with manual end-to-end tests or with many types of automated end-to-end tests. In my experience, wild pointers, stack overflows, race conditions and other timing problems seem to be most effectively found by high-volume automated system-level tests.
      • Security testing seems to require a combination of skilled manual attacks and routinized tests for standard vulnerabilities and high-volume automated probes.
      • As we get better at these three types of automated system-level tests, I suspect that we will develop better technology for automated functional tests.
    • For mobile apps, Knott makes a persuasive case that the pyramid has to be inverted (more system-level testing). This is because of an explosion of configuration diversity (significant variations of hardware and system software across phones) and because some tasks are location-dependent, time-dependent, or connection-dependent.
      • I think there is some value in remembering that problems like this happened in the Windows and Unix worlds too. In DOS/Windows/Apple, we had to test application compatibility with printers printer-by-printer in each application. Same for compatibility with video cards, keyboards, rodents, network interfaces, sound cards, fonts, disk formats, disk drive interfaces, etc. I’ve heard of similar problems in Unix, driven more by variation in system hardware and architecture than by peripherals. Gradually, in the Windows/Apple worlds, the operating systems expanded their reach and presented standardized interfaces to the applications. Claims of successful standardization were often highly exaggerated at first, but gradually, variances between printer models (etc.) have become much simpler, smaller testing issues. We should look at the weakly-managed device variation in mobile phones as a big testing problem today, that we will have to develop testing strategies to deal with, while recognizing that those strategies will become obsolete as the operating systems mature. I think the historical perspective is important because of the risk that the mobile testers of today will become the test managers of tomorrow. There is a risk that they will manage for the problems they matured with, even though those problems have largely been solved by better technology and better O/S design. I think I see that today, with managers of my generation, some of whom seem still worried about the problems of the 1980’s/1990’s. Today’s students need historical perspective so that they can anticipate the evolution of system risks and plan to evolve with them.
      • How does this apply to the inverted pyramid? I think that it is inverted today as a matter of necessity, but that 20 years from now, the same automation pyramid that applies to other applications will apply equally well to mobile. As the systems mature, the distortion of testing processes that we need to cope with immaturity will gradually fall away.
  • Rather than arguing about whether automation is desirable (it is) and whether it is often extremely useful (it is) and whether it is overhyped (it is) and whether simplistic strategies that won’t work are still being peddled to people not sophisticated enough to realize what trouble they’re getting into (they are), I want people to come out of Foundations with a few questions. Here’s my current list. I’m sure it will change.:
    • What types of information will we learn when we use this tool in this way?
    • What types of information are we missing and how much do we care about which types?
    • What other tools (or manual methods) can we use to fill in the knowledge gaps that we’ve identified?
    • What are the costs of using this technology?
    • Every added type of information comes at a cost. Are we willing to invest in a systematic approach to discovering a type of information about the software and if so, what will the cost be of that systematic approach? If the systematic approach is not feasible (too hard or too expensive), what manual methods for discovering the information are available, how effective are they, how expensive are they and how thorough a look are we willing to pay for?

Please send us good pointers to articles / blog posts that Foundations students can read to learn more about these questions (and ways to answer them). The course itself will go over this lightly. To address this in more depth would require a whole course, rather than a one-hour lecture. Students who are interested in this material (as every wise student will be) will learn most of what they learn on the topic  from the readings.

A supplementary video:

The emphasis on automation raises a different cluster of issues having to do with the student’s career path. There used to be a solid career for a person who knew basic black box testing techniques and had general business knowledge. This person tested at the black box level, doing and designing exploratory and/or scripted tests. I think this will gradually die as a career. We have to note the resurgence of pieceworking, a type of employment where the worker is paid by the completed task (the “piece”) rather than by the time spent doing the task. If you can divide a task into many small pieces of comparable difficulty, you can spread them out across many workers. In general, I think pieceworking provides low incomes for people who are doing tasks that don’t require rare skills. Take a look at Soylent, “an add-in to Microsoft Word that uses crowd contributions to perform interactive document shortening, proofreading, and human-language macros.” Look over their site and read their paper. Think forward ten years. How much basic, system-level testing will be split out as piecework rather than done in-house or outsourced?

I’m not advocating for this change. I’m looking at a social context (that’s what context-driven people do) and saying that it looks like this is coming down the road, whether we like it or not.

There is already a big pay differential between traditional black box testers and testers who write code as part of their job. At Florida Tech, I see a big gap in starting salaries of our students. Students who go into traditional black box roles get much lower offers. As pieceworking gets more popular, I think the gap will grow wider. I think that students who are training today for a job as a traditional black-box tester are training for a type of job that will vanish or will stop paying a middle-class wage for North American workers. I think that introductory courses in software testing have a responsibility to caution students that they need to expand their skills if they want a satisfactory career in testing.

Some people will challenge my analysis. Certainly, I could be wrong. As Rob Lambert pointed out recently, people have been talking about the demise of generalist, black-box testers for years and years and they haven’t gone away yet. And certainly, for years to come, I believe there will be room for a few experts who thrive as consultants and a few senior testers who focus strictly on testing but are high-skill designers/planners. Those roles will exist, but how many?

Given my analysis (which could be wrong), I think it is the responsibility of teachers of introductory testing courses to caution students about the risk of working in traditional testing and the need to develop additional skills that can market well with an interest in testing.

  • Combining testing and programming skill is the obvious path and the one that probably opens the broadest set of doors.
  • Another path combines testing with deep business knowledge. If you know a lot about actuarial math and about the culture of actuaries, you might be able to provide very valuable services to companies that develop software for actuaries. However, your value for non-actuarial software might be limited.
  • For some types of products or services, you might add unusual value if you have expertise in human factors, or in accessibility, or in statistical process control, or in physics. Again, you offer unusual value for companies that need your combination of skills but perhaps only generic value for companies that don’t need your specific skills.
  • I don’t think that a quick, informed-amateur-level survey of popular topics in cognitive psychology will provide the kind of skills or the depth of skills that I have in mind.

It’s up to the student to figure out their career path, but it’s up to us to tell them that a career that has been solid for 50 years is changing character in ways they have to prepare for.

I’ve been trying to figure out how to focus a Lesson on this, or how to insert this into one of the other Lessons. I don’t think it will fit. There aren’t enough available hours in a 4-week online course. However, just as we gave application videos to students in the Domain Testing course (and some students watched only one video per Lesson while others watched several), I think I can create a watch-this-if-you-want-to video on career path that accompanies the lecture that addresses test automation.

3. Coverage is a multidimensional concept

My primary goal in teaching “coverage” was to open students’ eyes to the many different ways that they can measure coverage. Yes, there are structural coverage measures (statement coverage, branch coverage, subpath coverage, multicondition coverage, etc.) but there are also many black-box measures. For example, if you have a requirements specification, what percentage of the requirements items have you tested the program against?

Coverage is a measure, normally expressed as a percentage. There is a set of things that you could test: all the lines of code, or all the relevant flavors of smartphones’ operating systems, or all the visible features, or all the published claims about the program, etc. Pick your set. This is your population of possible test targets of a certain kind (target: thing you can test). The size of your set (the number of possible test targets) is your denominator. Count how many you’ve actually tested: that’s your numerator. Do the division and you get the proportion of possible tests of this kind that you have actually run. That’s your coverage measure. Statement coverage is the percentage of statements you’ve tested. Branch coverage is the percentage of branches you’ve tested. Visible-feature coverage is the percentage of visible features that you’ve tested thoroughly enough to say, “I’ve tested that one.”

Complete coverage is 100% coverage of all of the possible tests. The program is completely tested if there cannot be any undiscovered bugs (because you’ve run all the possible tests). For nontrivial programs, complete testing is impossible because the population of possible tests is infinite. So you can’t have complete coverage. You can only have partial coverage. And at that point, we teach you to subdivide the world into types of test targets and to think in terms of many test coverages. My main paper on testing coverage lists 101 different types of coverage. There are many more. Maybe you want to achieve 95% statement coverage and 2% mouse coverage (how many different rodents do you need to test?) and 100% visible feature coverage and so on. Evaluating the relative significance of the different types of coverage gives you a way to organize and prioritize your testing.

I had generally positive results teaching this in Foundations 1.0 and 2.0, but it was clear that some students didn’t understand programming concepts well enough to understand what the structural coverage measures actually were or how they might go about collecting the data. Around 2008, I realized that many of my university students (computer science majors) had this problem too. They had studied control structures, of course, but not in a way that they could easily apply to coverage measurement. In addition, they (and my non-programmer practitioner students) had repeatedly shown confusion about how computers represent data, how rounding error comes about in floating point calculations, why rounding error is inherent in floating point calculations, etc. These confusions had several impacts. For example some students couldn’t fathom overflows (including buffer overflows). Many students insisted that normal floating-point rounding errors were bugs. I had seen mistakes like these in real-life bug reports that damaged the credibility of the bug reporter. So, we decided that for Foundations 3.0, we would add basic information about programming and data representation to the course.

Critique:

  • The programming material was very helpful for some students. For university students, it built a bridge between their programming knowledge and their testing lessons. Confusions that I had seen in Foundations 2.0 simply went away. For non-programmer professional students, the results were more mixed. Some students gained a new dimension of knowledge. Other students learned the material well enough to be tested on it but considered it irrelevant and probably never used it. And some students learned very little from this material, whined relentlessly and were irate that they had to learn about code in a course on black box testing.
  • The programming material overshadowed other material in Lessons 4 and 5. Some students paid so much attention to this material that they learned it adequately, but missed just about everything else. Some students dropped the course out of fear of the material or because they couldn’t find enough time to work through this material at their pace.
  • Many students never understood that coverage is a measurement. Instead, they would insist that “statement coverage” means all statements have been tested, and generally, that X coverage means that all X’s have been tested. I think we could have done a much better job with this if we had more time.
  • The lecture also addressed the risk of driving testing to achieve high coverage of a specific kind. I taught Marick’s How to Misuse Code Coverage, which describes the measurement dysfunction that he saw in organizations that pushed their testing to achieve very high statement/branch coverage. The students handled this material well on exams. The programming instruction probably helped with this.

Tentative decisions:

  • Return to a lecture that emphasizes the measurement aspects of coverage, the multidimensional nature of coverage, and the risks of driving testing to achieve high coverage (rather than high information).
  • Reduce the programming instruction to an absolute minimum. We need to teach a little bit about control structures in order to explain structural coverage, but we’ll do it briefly. This will create the same problems for students as we had in Foundations 1.0 and 2.0.
  • Create a series of supplementary videos that describe data representation and control structures. I’ll probably use the same lecture material in my Introduction to Programming in Java (Computer Science 1001). We’ll encourage students to watch the videos, probably offer a quiz on the material to help them learn it, maybe offer a bonus question on the exam, but won’t require the students to look at it.

4. Complete testing is impossible

This lecture teaches students that it is impossible to completely test a program. In the course of this instruction, we teach some basic combinatorics (showing students how to calculation the number of possible tests of certain kinds) and expose them again, in a new way, to the wide variety of types of test targets (things you can test) and the difficulty of testing all of them. We teach the basics of data flows, the basics of path testing, and provide a real-life example of the kinds of bugs you can miss if you don’t do extensive path testing.

Critique:

  • Overall, I think this material was OK in Foundations 3.0.
  • The lecture introduces a strong example of a real-life application of high-volume automated testing but provides only the briefest introduction to the concept of high-volume automated testing. The study guide (list of potential exam questions) included questions on high-volume automated testing but we were reluctant to ask any of them because we treated the material too lightly.

Tentative decisions:

  • Cover essentially the same material.
  • Integrate more tightly with the preceding material on the many types of coverage.
  • Add a little more on high-volume automated testing.

5. Measurement is important, but difficult

The lecture introduces students to the basics of measurement theory, to the critical importance of construct validity, to the prevalent use of surrogate measures and the difficulties associated with this, and to measurement dysfunction and distortion. We illustrate measurement dysfunction with bug-count metrics.

Critique:

  • Overall, the section was reasonably effective in conveying what it was designed to convey, but that design was flawed.
  • The lecture exuded mistrust of bad metrics. It provided no positive inspiration for doing good measurement and very little constructive guidance. Back in 2001, Pettichord and Bach and I wrote, “Metrics that are not valid are dangerous.” We inscribed the words in Lessons Learned in Software Testing and in the statement of the Context-Driven Testing Principles.
    • Back in the 1990’s (and still in 2001), we didn’t have much respect for the metrics work that had been done and we thought that bright people could easily make a lot of progress toward developing valid, useful metrics. We were wrong. The next dozen years taught me some humility. The context-driven testing community (and the broader software engineering community) made remarkably little progress on improving software-related measurement. Some of us (including me) have made some unsuccessful efforts. I don’t see any hint of a breakthrough on the horizon. Rather, I see more consultants who simply discourage clients and students from taking measurements or who present oversimplified descriptions of qualitative measurement that don’t convey the difficulties and the cost of doing qualitative measurement well. This stuff plays well at conferences but it’s not for me.
      • I am against bad measurement (and a lot of terrible measurement is being advocated in our field, and students need some help recognizing it). But I am not against measurement. Back in the 1990s and early 2000s, I discouraged people from taking measurement, because I thought  a better set of metrics was around the corner. It’s not. Eventually, my very wise friend, Hung Quoc Nguyen counseled me that it was time for me to rethink my position. He argued that the need for data to guide management is still with us — that if we don’t have better measures available, we need to find better ways to work with what we’ve got. I have come to agree with his analysis.
      • Many people in our field are required to provide metrics to their management. We can help them improve what they provide. We can help them recognize problems and suggest better alternatives. But we aren’t helping them if we say that providing metrics is unethical and they need to say refuse to do it, even if that costs them their jobs. I wrote about this in Contexts differ: Recognizing the difference between wrong and Wrong and then in Metrics, Ethics, & Context-Driven Testing (Part 2).
      • Becky Fiedler and I have used qualitative approaches several times. We’re enthusiastic about them. See our paper on Putting the Context in Context-Driven Testing, for example. But we have no interest in misrepresenting the time and skill required for this kind of work or the limitations of it. This is a useful set of tools, but it is not a collection of silver bullets.

Tentative decisions:

  • We have to fundamentally rework this material. I will probably start from two places:
  • I think these provide a better starting point for a presentation on metrics that:
    • Introduces basic measurement theory
    • Alerts people to the risks associated with surrogate measures, weak validity, and abusive use of metrics
    • Explains measurement distortion and dysfunction
    • Introduces basic qualitative measurement concepts
    • Lays out the difficulties of doing good measurement, peeks at the state of metrics in some other fields (not better than ours), and presents a more positive view on ways to gain useful information from imperfect statistics.
  • The big challenge here, the enormous challenge, will be fitting this much content into a reasonably short lecture.

Racial Profiling in Ferguson Missouri? A Note on Statistical Interpretation

Thursday, August 21st, 2014

As I write this, there is serious community tension in Ferguson, Missouri over the shooting and killing of a unarmed black teenager by a white police officer, and over the response to that shooting by the local police force.

The narrative in much of the press is that this is yet another incident that illustrates a serious problem of racism in the United States, especially in our police. For example, many newspapers cite data from the Missouri 2013 Vehicle Stops Report. The overall report covers data in every county in Missouri and presents years of historical comparisons. The report specifically for Ferguson is here.

Here are some examples of press coverage that seem representative of what I’ve seen in many papers online:

“Last year, 86 percent of the cars stopped by Ferguson police officers were being driven by African-Americans, according to the state’s annual racial profiling report. Once pulled over in Ferguson, African-American drivers were twice as likely to be searched, according to the report.”

http://www.mcclatchydc.com/2014/08/19/237001_feds-could-go-several-ways-in.html?rh=1#storylink=cpy

“Last year, for the 11th time in the 14 years that data has been collected, the disparity index that measures potential racial profiling by law enforcement in the state got worse. Black Missourians were 66 percent more likely in 2013 to be stopped by police, and blacks and Hispanics were both more likely to be searched, even though the likelihood of finding contraband was higher among whites.”

http://www.stltoday.com/news/opinion/columns/the-platform/editorial-michael-brown-and-disparity-of-due-process/article_40bb2d0e-8619-534a-b629-093ebc79f0a6.html

Of course, there are some counter-examples. Some news reports (what I’ve seen on Fox, for example) seem to ignore these data completely and instead appear to me to present the events in terms of violent bad black people who deserve whatever violent treatment the police provide for them. There is nothing useful to learn about data evaluation from these reports, so I will ignore them for the rest of this note.

I should state my bias: My personal (nonexpert) impression is that the shooting was unjustified and that the St. Louis County police response has been inappropriate. I have no insight into the motivation of anyone involved.

However, if you look at the actual numbers from Ferguson, it is not clear to me that the conclusions of racial profiling, conclusions like the ones quoted above, that have appeared in every news source that I respect, are justified by the data.

The focus of this blog is on the teaching of software engineering topics, primarily software testing and measurement (and thus too, statistical analysis).

The data from Ferguson provide an interesting example for caution in the interpretation of such data.

First, some of the numbers that are consistent with the summaries. According to the Attorney General’s report

  • Ferguson’s population (age 16 and over) is 15,865, of whom 63% are black.
  • 4632 of 5384 vehicle stops (86%) were of blacks, a much higher percentage than the 63% of the population
  • 562 of the 611 searches (92%) were of blacks
  • 483 of the 521 arrests (93%) were of blacks
  • 12.13% of the blacks who were stopped were searched, compared to only 6.85% of the whites
  • 21.71% of the blacks who were searched had contraband (drugs, weapons, stolen property) compared to 34.04% of the whites.

These data appear to suggest two conclusions:

  1. Blacks are being stopped, searched and arrested at a higher rate than their representation in the population
  2. Many more searches of blacks than whites are unproductive, suggesting the police would find more contraband if they searched fewer blacks and more whites.

If Ferguson’s police are continuing to search blacks at a much higher frequency than whites, even though searched whites have contraband at a higher frequency than searched blacks, this appears to suggest a pattern that is racist and counterproductive (less protective of public safety).

That conclusion, I think, is the conclusion the newspapers are inviting us to draw.

Let’s look at some more data.

  • 66% (369/562) of the searches and 76% (369/483) of the arrests of blacks involved an outstanding warrant
  • 30% (14/47) of the searches and 39% (14/36) of the arrests of whites involved an outstanding warrant

Searches and arrests involving warrants don’t involve much exercising of judgment on the part of the officer who is searching or arresting someone.

  • A warrant is an order from a court to arrest someone. The officer is supposed to stop and arrest a person if there is a warrant out for them.
  • When a police officer arrests someone, they must search the person. Among the many important reasons for this rule is the safety of the officer: arresting someone and then not checking them carefully for weapons would be extremely unwise.

In a community of only 15,865 people, it would not be surprising for the local police to be aware of most of the people who have warrants outstanding against them or for these police to recognize those people on the street.

Because the police are supposed to arrest people who have warrants against them and supposed to search people they arrest, I don’t think we should count these numbers of stops, searches and arrests against the police.

If you look only at the stops that didn’t involve outstanding warrants,

  • 34% of the times that police searched a black person, and 24% of the times the police arrested a black person, the search did not involve an outstanding warrant.

In contrast

  • 70% of the times that the police searched a white person, and 61% of the time they arrested a white person, the search did not involve an outstanding warrant.

The conclusions that these numbers suggest to me are that:

  • the Ferguson police appear to have been making discretionary stops (stops in which they were exercising their own judgment, rather than executing a court order) of white people at almost twice the rate as for black people
  • the higher contraband-find rate for whites than blacks might be because a higher proportion of whites were searched on the basis of police suspicion of contraband, compared to a higher proportion of blacks being searched as part of an arrest that involved a warrant (past bad behavior, not currently suspicious behavior). Considered this way, the disparity (higher rate of contraband-finds for whites versus blacks) seems unsurprising and not at all suggestive of bad police work.

I don’t know what truth underlies these numbers. I think that, for me to interpret them with any confidence, I would have to do other studies, such as riding along with Ferguson police and learning how they decide who to stop and what post-stop behaviors trigger further investigation (such as searches or checks for outstanding warrants).

What does seem clear to me is that the first conclusion (racially-motivated differences in the police officers’ decisions to search people) is not supported by these data. That motivation might be present but—despite first appearances—these data do not seem to be evidence of it.

I wrote this note because it suggests two important lessons for students of statistics and research design:

  1. In many cases (as here), data may show statistically significant (large, probably consistent) differences. However, interpretation of those differences is almost always open to further investigation.
    • The numbers don’t tell you what they mean. Even the most statistically significant trends must be interpreted by people.
  2. In many cases (as here), the data support alternative interpretations.
    • Whenever possible, you should look at your data in many ways, to see if they tell you the same story. If they don’t, you need to investigate further, and maybe fix your model.
    • A few numbers in isolation tell you very little, often much less than you would initially imagine.
    • If you design your research (or your management) so that you will see only a few numbers at the end, you are designing tunnel vision into your work. You are creating your own context for bad interpretations and bad decisions.

On the Quality of Qualitative Measures

Monday, April 28th, 2014

On the Quality of Qualitative Measures

Cem Kaner, J.D., Ph.D. & Rebecca L. Fiedler, M.B.A., Ph.D.

This is an informal first draft of an article that will summarize some of the common guidance on the quality of qualitative measures.

  • The immediate application of this article is to Kaner’s courses on software metrics and software requirements analysis. Students would be well-advised to read this summary of my lectures carefully (yes, this stuff is probably on the exam).
  • The broader application is to the increasingly large group of software development practitioners who are considering using qualitative measures as a replacement for many of the traditional software metrics. For example, we see a lot of attention to qualitative methods in the agenda of the 2014 Conference of the Association for Software Testing. We won’t be able to make it to CAST this year, but perhaps these notes will provide some additional considerations for their discussions.

On Measurement

Managers have common and legitimate informational needs that skilled measurement can help with. They need information in order to (for example…)

  • Compare staff
  • Compare project teams
  • Calculate actual costs
  • Compare costs across projects or teams
  • Estimate future costs
  • Assess and compare quality across projects and teams
  • Compare processes
  • Identify patterns across projects and trends over time

Executives need these types of information, whether we know how to provide them or not.

Unfortunately, there are strong reasons to be concerned about the use of traditional metrics to answer questions like these. These are human performance measures. As such, they must be used with care or they will cause dysfunction (Austin, 1996). That has been a serious real-life problem (e.g. Hoffman 2000). The empirical basis supporting several of them has been substantially exaggerated (Bossavit, 2014). Many of the managers who use them know so little about mathematics that they don’t understand what their measurements mean and their primary uses are to placate management or intimidate staff. Many of the consultants who give talks and courses advocating metrics also seem to know little about mathematics or about measurement theory. They seem unable to distinguish strong questions from weak ones, unable to discuss the underlying validity of the measures they advocate, and so they seem reliant on appeals to authority, on the intimidating quality of published equations, and on the dismissal of the critic as a nitpicker or an apologist for undisciplined practices.

In sum, there are problems with the application of traditional metrics in our field. It is no surprise that people are looking for alternatives.

In a history of psychological research methods, Kurt Danziger (1994) discusses the distorting impact of quantification on psychological measurement. (See especially his Chapter 9, From quantification to methodolatry.) Researchers designed experiments that looked more narrowly at human behavior, ignoring (designing out of the research) those aspects of behavior or experience that they could not readily quantify and interpret in terms of statistical models.

“All quantitative data is based upon qualitative judgments.”
(Trochim, 2006 at http://www.socialresearchmethods.net/kb/datatype.php)

Qualitative methods might sometimes provide a richer description of a project or product that is less misleading, easier to understand, and more effective as a source of insight. However, there are problems with the application of qualitative approaches.

  • Qualitative reports are, at their core, subjective.
  • They are subject to bias at every level (how the data are gathered or selected, stored, analyzed, interpreted and reported). This is a challenge for every qualitative researcher, but it is especially significant in the hands of an untrained researcher.
  • They are based on selected data.
  • They aren’t very helpful for making comparisons or for providing quantitative estimates (like, how much will this cost?).
  • They are every bit as open to abuse as quantitative methods.
  • And it costs a lot of effort to do qualitative measurement well.

We are fans of measurement (qualitative or quantitative) when it is done well and we are unenthusiastic about measurement (qualitative or quantitative) when it is done badly or sold overenthusiastically to people who aren’t likely to understand what they’re buying.

Because this paper won’t propagandize qualitative measurement as the unquestioned embodiment of sweetness and light, some readers might misunderstand where we are coming from. So here is a little about our background.

  • As an undergraduate, Kaner studied mainly mathematics and philosophy. He also took two semesters of coursework with Kurt Danziger. We only recently read Danziger (1994) and realized how profoundly Danziger has influenced Kaner’s professional development and perspective. As a doctoral student in experimental psychology, Kaner did some essentially-qualitative research (Kaner et al., 1978) but most of his work was intensely statistical, applying measurement theory to human perception and performance (e.g. Kaner, 1983). He applied qualitative methods to client problems as a consultant in the 1990’s. His main stream of published critiques of traditional quantitative approaches started in 1999 (Kaner, 1999a, 1999b). He wrote explicitly about the field’s need to use qualitative measures in 2002. He started giving talks titled “Software Testing as a Social Science” in 2004, explicitly identifying most software engineering measures as human performance measures subject to the same types of challenges as we see in applied measurement in psychology and in organizational management.
  • Fiedler’s (2006, 2007) dissertation used Cultural-Historical Activity Theory (CHAT)–a framework for analyzing and organizing qualitative investigations–to examine portfolio management software in universities. Kaner & Fiedler started applying CHAT to scenario test design in 2007. We presented a qualitative-methods tutorial and a long paper with detailed pointers to the literature at CAST in 2009 and at STPCon in Spring 2013. We continue to use and teach these ideas and have been working for years on a book relating qualitative methods to the design of scenario tests.

We aren’t new to qualitative methods. This is not a shiny new fad for us. We are enthusiastic about increasing the visibility and use of these methods but we are keenly aware of the risk of over-promoting a new concept to the mainstream in ways that dilute the hard parts until all that remains are buzzwords and rituals. (For us, the analogies are Total Quality Management, Six Sigma, and Agile Development.)

Perhaps some notes on what makes qualitative measures “good” (and what doesn’t) might help slow that tide.

No, This is Not Qualitative

Maybe you have heard a recommendation to make project status reporting more qualitative. To do this, you create a dashboard with labels and faces. The labels identify an area or issue of concern, such as how buggy the software is. And instead of numbers, use colored faces because this is more meaningful. A red frowny-face says, There is trouble here. A yellow neutral-face says, Things seem OK, nothing particularly good or bad to report now. And a green smiley-face says, Things go well. You could add more differentiation by having a brighter red with a crying-face or a screaming-or-cursing-face and by having a brighter green with a happy-laughing face.

See, there are no numbers on this dashboard, so it is not quantitative, right?

Wrong.

The faces are ordered from bad to good. You can easily assign numerals to these, 1 for red-screaming-face through 5 for green-laughing-face, you can talk about the “average” (median) score across all the categories of information, you can even draw graphs of the change of confidence (or whatever you map to happyfacedness) from week to week across the project.

This might not be very good quantitative measurement but as qualitative measurement it is even worse. It uses numbers (symbols that are equivalent to 1 to 5) to show status without emphasizing the rich detail that should be available to explain and interpret the situation.

When you watch a consultant present this as qualitative reporting, send him away. Tell him not to come back until he actually learns something about qualitative measures.

OK, So What is Qualitative?

A qualitative description of a product or process is a detail-rich, multidimensional story (or collection of stories) about it. (Creswell, 2012; Denzin & Lincoln 2011; Patton, 2001).

For example, if you are describing the value of a product, you might present examples of cases in which it has been valuable to someone. The example wouldn’t simply say, “She found it valuable.” The example would include a description of what made it valuable, perhaps how the person used it, what she replaced with it, what made this one better than the last one, and what she actually accomplished with it. Other examples might cover different uses. Some examples might be of cases in which the product was not useful, with details about that. Taken together, the examples create an overall impression of a pattern – not just the bias of the data collector spinning the tale he or she wants to tell. For example, the pattern might be that most people who try to do THIS with the product are successful and happy with it, but most people who try to do THAT with it are not, and many people who try to use this tool after experience with this other one are likely to be confused in the following ways …

When you describe qualitatively, you are describing your perceptions, your conclusions, and your analysis. You back it up with examples that you choose, quotes that you choose, and data that you choose. Your work should be meticulously and systematically even-handed.  This work is very time-consuming.

Quantitative work is typically easier, less ambiguous, requires less-detailed knowledge of the product or project as a whole, and is therefore faster.

If you think your qualitative measurement methods are easier, faster and cheaper than the quantitative alternatives, you are probably not doing the qualitative work very well.

Quality of Qualitative

In quantitative measurement, questions about the value of a measure boil down to questions of validity and reliability.

A measurement is valid to the extent that it provides a trustworthy description of the attribute being measured. (Shadish, Cook & Campbell, 2001)

A measurement is reliable to the extent that repeating the same operations (measuring the same thing in the same ways) yields the same (or similar) results.

In qualitative work, the closest concept corresponding to validity is credibility (Guba & Lincoln, 1989). The essential question about the credibility of a report of yours is, Why should someone else trust your work? Here are examples of some of the types of considerations associated with credibility.

Examples of Credibility-Related Considerations

The first and most obvious consideration is whether you have the background (knowledge and skill) to be able to collect, interpret and explain this type of data.

Beyond that, several issues come up frequently in published discussions of credibility. (Our presentation is based primarily on Agostinho, 2005; Creswell, 2012; Erlandson et al., 1993; Finlay, 2006; and Guba & Lincoln, 1989.)

  • Did you collect the data in a reasonable way?
    • How much detail?: Students of ours work with qualitative document analysis tools, such as ATLAS.ti, Dedoose, and NVivo. These tools let you store large collections of documents (such as articles, slides, and interview transcripts), pictures, web pages, and videos (https://en.wikipedia.org/wiki/Computer_Assisted_Qualitative_Data_Analysis_Software). We are now teaching scenario testers to use the same types of tools. If you haven’t worked with one of these, imagine a concept-mapping tool that allows you to save all the relevant documents as sub-documents in the same document as the map and allows you to show the relationships among them not just with a two-dimensional concept map but with a multidimensional network, a set of linkages from any place in any document to any place in any other document.

    As you see relevant information in a source item, you can code it. Coding means applying meaningful tags to the item, so that you can see later what you were thinking now. For example, you might code parts of several documents as illustrating high or low productivity on a project. You can also add comments to these examples, explaining for later review what you think is noteworthy about them. You might also add a separate memo that describes your ideas about what factors are involved in productivity on this project, and another memo that discusses a different issue, such as notes on individual differences in productivity that seem to be confounding your evaluation of tool-caused differences. Later, you can review the materials by looking at all the notes you’ve made on productivity—all the annotated sources and all your comments.

    You have to find (or create) the source materials. For example, you might include all the specification-related documents associated with a product, all the test documentation, user manuals from each of your competitors, all of the bug reports on your product and whatever customer reports you can capture for other products, interviews with current users, including interviews with extremely satisfied users, users who abandoned the product and users who still work with the product but hate it. Toss in status reports, comments in the source code repository, emails, marketing blurbs, and screen shots. All these types of things are source materials for a qualitative project.

    You have to read and code the material. Often, you read and code with limited or unsophisticated understanding at first. Your analytical process (and your ongoing experience with the product) gives you more insight, which causes you to reread and recode material. The researcher typically works through this type of material in several passes, revising the coding structure and adding new types of comments (Patton, 2001). New information and insights can cause you to revise your analysis and change your conclusions.

    The final report gives a detailed summary of the results of this analysis.

    • Prolonged engagement: Did you spend enough time at the site of inquiry to learn the culture, to “overcome the effects of misinformation, distortion, or presented ‘fronts’, to establish rapport and build the trust necessary to overcome constructions, and to facilitate immersing oneself in and understanding the context’s culture”?
    • Persistent observation: Did you observe enough to focus on the key elements and to add depth? The distinction between prolonged engagement and persistent observation is the difference between having enough time to make the observations and using that time well.
    • Triangulation and convergence. Triangulation leads to credibility by using different or multiple sources of data (time, space, person), methods (observations, interviews, videotapes, photographs, documents), investigators (single or multiple), or theory (single versus multiple perspectives of analysis).” (Erlandson et al. 1993, p. 137-138). “The degree of convergence attained through triangulation suggests a standard for evaluating naturalistic studies. In other words, the greater the convergence attained through the triangulation of multiple data sources, methods, investigators, or theories, the greater the confidence in the observed findings. The convergence attained in this manner, however, never results in data reduction but in an expansion of meaning through overlapping, compatible constructions emanating from different vantage points.” (Erlandson et al. 1993, p. 139).
  • Are you summarizing the data fairly?
  • How are you managing your biases (people are often not conscious of the effects of their biases) as you select and organize your observations?
  • Are you prone to wishful thinking or to trying to please (or displease) people in power?
    • Peer debriefing: Did you discussion your ideas with one or more disinterested peers who gave constructively critical feedback, questioned your ideas, methods, motivation, and conclusions?
    • Disconfirming case analysis: Did you look for counter-examples? Did you revise your working hypotheses in light of experiences that were inconsistent with them?
    • Progressive subjectivity: As you observed situations or created and looked for data to assess models, how much did you pay attention to your own expectations? How much did you consider the expectations and observations of others. An observer who affords too much privilege to his or her own ideas is not paying attention.
    • Member checks: If you observed / measured / evaluated others, how much did you involve them in the process? How much influence did they have over the structures you would use to interpret the data (what you saw or heard or read) that you got from them? Do they believe you accurately and honestly represented their views and their experiences? Did you ask?

Transferability

The concerns that underlie transferability are much the same as for external validity (or generalization validity) of traditional metrics:

  • If you or someone else did a comparable study in a different setting, how likely is it that you would make the same observations (see similar situations, tradeoffs, examples, etc.)?
  • How well would your conclusions apply in a different setting?

When evaluating a research report, thorough description is often a key element. The reader doesn’t know what will happen when someone tries a similar study in the future, so they (and you) probably cannot authoritatively predict generalizability (whether people in other settings will see the same things). However, if you describe what you saw well enough, in enough detail and with enough attention to the context, then when someone does perform a potentially-comparable study somewhere else, they will probably be able to recognize whether they are seeing things that are similar to what you were seeing.

Over time, a sense of how general something is can build as multiple similar observations are recorded in different settings

Dependability

The concerns that underlie dependability are similar to those for internal validity of traditional metrics. The core question is whether your work is methodologically sound.

Qualitative work is more exploratory than quantitative (at least, more exploratory than quantitative work is traditionally described). You change what you do as you learn more or as you develop new questions. Therefore consistency of methodology is not an ultimate criterion in qualitative work, as it is for some quantitative work.

However, a reviewer can still ask how well (methodologically) you do your work. For example:

    • Do you have the necessary skills and are you applying them?
    • If you lack skills, are you getting help?
    • Do you keep track of what you’re doing and make your methodological changes deliberately and thoughtfully? Do you use a rigorous and systematic approach?

Many of the same ideas that we mentioned under credibility apply here too, such as prolonged engagement, persistent observation, effort to triangulate, disconfirming case analysis, peer debriefing, member checks and progressive subjectivity. These all describe how you do your work.

  • As issues of credibility, we are asking whether you and your work are worth paying attention to? Your attention to methodology and fairness reflect on your character and trustworthiness.
  • As issues of methodology, we are asking more about your skill than about your heart.

Confirmability

Confirmability is as close to reliability as qualitative methods get, but the qualitative approach does not rest as firmly on reliability. The quantitative measurement model is mechanistic. It assumes that under reasonably similar conditions, the same acts will yield the same results. Qualitative researchers are more willing to accept the idea that, given what they know (and don’t know) about the dynamics of what they are studying, under seemingly-similar circumstances, the same things might not happen next time.

We assess reliability by taking repeated measurements (do similar things and see what happens). We might assess confirmability as the ability to be confirmed rather than whether the observations were actually confirmed. From that perspective, if someone else works through your data:

  • Would they see the same things as you?
  • Would they generally agree that things you see as representative are representative and things that you see as idiosyncratic are idiosyncratic?
  • Would they be able to follow your analysis, find your records, understand your ways of classifying things and agree that you applied what you said you applied?
  • Does your report give your reader enough raw data for them to get a feeling for the confirmability of your work?

In Sum

Qualitative measurements tell a story (or a bunch of stories). The skilled qualitative researcher relies on transparency in methods and data to tell persuasive stories. Telling stories that can stand up to scrutiny over time takes enormous work. This work can have great value, but to do it, you have to find time, gain skill, and master some enabling technology. Shortchanging any of these areas can put your credibility at risk as decision-makers rely on your stories to make important decisions.

References

S. Agostinho, (2005, March). “Naturalistic inquiry in e-learning research“, International Journal of Qualitative Methods 4 (1).

R. D. Austin (1996). Measuring and Managing Performance in Organizations. Dorset House.

L. Bossavit (2014). The Leprechauns of Software Engineering: How folklore turns into fact and what to do about it. Leanpub.

J. Creswell (2012, 3rd ed.). Qualitative Inquiry and Research Design: Choosing Among Five Approaches. Sage Publications.

K. Danziger (1994). Constructing the Subject: Historical Origins of Psychological Research. Cambridge University Press.

N.K. Denzin & Y.S. Lincoln (2011, 4th ed.) The SAGE Handbook of Qualitative Research. Sage Publications.

D.A. Erlandson, E.L. Harris,B.L. Skipper & S.D. Allen, S. D. (1993). Doing Naturalistic Inquiry: A Guide to Methods. Sage Publications.

R. L. Fiedler (2006). “In transition”: An activity theoretical analysis examining electronic portfolio tools’ mediation of the preservice teacher’s authoring experience. Unpublished Ph.D. dissertation, University of Central Florida (Publication No. AAT 3212505).

R.L. Fiedler (2007). “Portfolio authorship as a networked activity“. Paper presented at the Society for Information Technology and Teacher Education.

R. L. Fiedler & C. Kaner (2009). “Putting the context in context-driven testing (an application of Cultural Historical Activity Theory)“. Conference of the Association for Software Testing. Colorado Springs, CO.

L. Finlay (2006). “‘Rigour’, ‘Ethical Integrity” or ‘Artistry”? Reflexively reviewing criteria for evaluating qualitative research.” British Journal of Occupational Therapy. 69 (7), 319-326.

E.G. Guba & Y.S. Lincoln (1989). Fourth Generation Evaluation. Sage Publications.

D. Hoffman (2000). “The darker side of metrics,” presented at Pacific Northwest Software Quality Conference, Portland, OR.

C. Kaner (1983). Auditory and visual synchronization performance over long and short intervals, Doctoral Dissertation: McMaster University.

C. Kaner (1999a). “Don’t use bug counts to measure testers.” Software Testing & Quality Engineering, May/June, 1999, p. 80.

C. Kaner (1999b). “Yes, but what are we measuring?” (Invited address) Pacific Northwest Software Quality Conference, Portland, OR.

C. Kaner (2002). “Measuring the effectiveness of software testers.”15th International Software Quality Conference (Quality Week), San Francisco, CA.

C. Kaner (2004). “Software testing as a social science.” IFIP Working Group 10.4 meeting on Software Dependability, Siena, Italy.

C. Kaner & R. L. Fiedler (2013). “Qualitative Methods for Test Design“. Software Test Professionals Conference (STPCon), San Diego, CA.

C. Kaner, B. Osborne, H. Anchel, M. Hammer & A.H. Black (1978). “How do fornix-fimbria lesions affect one-way active avoidance behavior?86th Annual Convention of the American Psychological Association, Toronto, Canada.

M.Q. Patton (2001, 3rd ed.). Qualitative Research & Evaluation Methods. Sage Publications.

W.R. Shadish, T.D. Cook & D.T. Campbell (2001, 2nd ed.). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Cengage.

W.M.K. Trochim (2006). Research Methods Knowledge Base.

Wikipedia (2014). https://en.wikipedia.org/wiki/Computer_Assisted_Qualitative_Data_Analysis_Software

 

Presentation on software metrics

Tuesday, March 26th, 2013

Most software metrics are terrible. They are as overhyped as they are poorly researched.

But I think it’s part of the story of humanity that we’ve always worked with imperfect tools and always will. We succeed by learning the strengths, weaknesses and risks of our tools, improving them when we can, and mitigating their risks.

So how do we deal with this in a way that manages risk while getting useful information to people we care about?

I don’t think there are easy answers to this, but I think several people in the field are grappling with this in a constructive way. I’ve seen several intriguing conference-talk descriptions in the last few months and hope to post comments that praise some of them later. But for now, here’s my latest set of notes on this: https://13j276.p3cdn1.secureserver.net/pdfs/PracticalApproachToSoftwareMetrics.pdf

 

Interactive Grading in University and Practitioner Classes: An Experience Report

Monday, January 14th, 2013

Summary: Graders typically work in private with no opportunity to ask even the simplest questions about a student’s submitted work. Interactive Grading is a technique that requires the student to participate in the grading of their work. The aim of this post is to share my experiences with interactive grading, with some tips for others who want to try it. I start with an overview and then provide three detailed examples, with suggestions for the conduct of the session. Let me stress one point: care must be exercised to keep the student comfortable and engaged, and not let the session degenerate into another lecture. In terms of results, most students told me that interactive grading was a good use of their time and that it helped them improve their performance. Several asked me to add more of it to my courses. A few viewed at as more indoctrination. In terms of impact on student grades, they showed marginal improvement.

Interactive grading is not a new idea. I think most instructors have done this for some students at some times. Certainly, I have. So when professor Keith Gallagher described it to me as an important part of his teaching, I didn’t initially understand its significance. I decided to try it myself after several of Keith’s students talked favorably about their classes with him. It wasn’t until I tried it that I realized what I had been missing.

That start came 15 months ago. Since then, I’ve used interactive grading in the online (professional-development) BBST classes, hybrid (online + face-to-face) university software testing classes, and a face-to-face university software metrics course. I’ve used it for exams, essays, and assignments (take-home complex tasks). Overall, it’s been a positive change.

These notes describe my personal experiences and reflections as an instructor. I emphasize this by writing in the very-obvious first person. Your experiences might be different.

What is Interactive Grading?

When I teach, I assign tasks and students submit their work to me for grading.

Usually I review student work in private and give them feedback after I have completed my review.

  • When I do interactive grading:
    • I meet with the student before I review the work.
    • I read the work for the first time while I meet with the student.
    • I ask the student questions, often open-ended questions that help me understand what the student was trying to say or achieve. Students often demonstrate that they understood the material better than their submitted work suggests. If they misunderstood part of the task, we can get to the bottom of the misunderstanding and they can try to demonstrate, during this meeting, their ability to do the task that was actually assigned.
    • I often coach the student, offering suggestions to improve the student’s strategy or demonstrating how to do parts of the task.
    • I typically show the student a grading rubric early in the meeting and assign the grade at the end of the meeting.
  • When I explicitly build interactive grading into my class:
    • It becomes part of the normal process, rather than an exception for a student who needs special attention. This changes the nature and tone of the discussions.
    • Every student knows well in advance what work will be interactively graded and that every student’s work will be handled the same way. This changes how they prepare for the meetings and how they interpret the meetings.
    • I can plan later parts of the course around the fact that the students have already had this experience. This changes my course design.

Costs and Benefits of Interactive Grading

Here is a summary of my conclusions. I’ll support this summary later in this report, with more detailed descriptions of what we did and what happened.

Costs

Interactive grading feels like it takes more time:

  • It takes time to prepare a grading structure that the students can understand (and therefore that I can use effectively when we have the meeting)
  • Scheduling can take a lot of time.
  • The meetings sometimes run long. It feels as though it takes longer to have the meeting than it would take to grade the work normally.

When I’ve checked my grading times for exams and assignments that I do in the traditional way, I think I actually spend the same amount of time (or more). I also do the same level of preparation. (Note: I do a lot of pre-grading preparation. The contrast might be greater for a less formal grader.)

As far as I can tell, the actual difference for me is not time, it is that interactive grading meetings are more stressful for me than grading in a quiet, comfortable home office. That makes it feel longer.

Benefits for the Students

During interactive grading, I can ask questions like,

  • What were you thinking?
  • What do you think this word in the question means? If I gave you a different explanation of what this word means, how would that affect your answer to the question?
  • Give me an example of what you are describing?
  • Can you give me a real-life example of what you are describing? For example, suppose we were working with OpenOffice. How would this come up in that project?
  • Can you explain this with a diagram? Show me on my whiteboard.
  • How would you answer this if I changed the question’s wording this way?
  • How would someone actually do that?
  • Why would anyone want to do that task that way? Isn’t there a simpler way to do the same thing?

I will raise the grade for a student who does a good job with these questions. I might say to the student,

“If I was only grading the written answer, you would get a ‘D’. But with your explanation, I am giving you a ‘B’. We need to talk about how you can present what you know better, so that you can get a ‘B’ again on the next exam, when I grade your written answers without an interactive supplement.”

If a student performs poorly on an exam (or an assignment, or an essay), the problem might be weak competence or weak performance.

  • A student who doesn’t know the material has no competence.
  • A student who knows the material but gives a poor answer on an exam despite that is showing poor performance. For example, you won’t get a good answer from a knowledgeable student who writes poorly or in a disorganized way or who misunderstands the question.

These focusing questions:

  • give the student who knows the material a chance to give a much better explanation or a much better defense of their answer.
  • give the student who knows the material a chance but performs poorly some indicators of the types of performance improvements they need to work on.
  • give the student who doesn’t know the material a clear form of feedback on why they are getting a poor grade.

Let’s consider competence problems. Students might lack the knowledge or the skills they are supposed to be learning for several reasons:

  • Some students simply don’t take the time or make the effort to do good work. Interactive grading probably won’t do much for them, beyond helping some of them understand the standards better.
  • Some students memorize words that they don’t really understand. The interactive grading discussion helps (some of) them understand a bit better the differences between memorized words and understanding something well enough to explain it in their own words and to explain how to do it or use it or why it’s important. It gives them a path to a different type of answer when they ask themselves while studying, “Do I know this well enough?”
  • Some students lack basic student-skills (how to study, how to look things up online, how to use the library, etc.) I can demonstrate these activities during the discussion, having the student do them with me. Students don’t become experts overnight with these, but as I’ll discuss below, I think this sometimes leads to noticeable improvements.

Some of the tasks that I assign to students can be done in a professional way. I choose tasks that I can demonstrate at a professional level of skill. In the give-and-take of interactive grading, I can say, “Let me show you how a professional would do that.” There is a risk of hijacking the discussion, turning it into yet-another-lecture. But judiciously used, this is a very personalized type of coaching of complex skills.

Now consider performance problems. The student understands the material but provides an exam answer or submits an assigned paper that doesn’t adequately reflect what they know, at their level of sophistication. A student with an “A” level of knowledge might look like a “C” student. A student with a “C” level of knowledge might look like a “D” or “F” student. These students are often puzzled by their poor grades. The interactive grading format makes it possible for me to show a student example after example after example of things they are doing (incomprehensible sentences, uninterpretable diagrams, confusing structure, confusing formatting, etc.) and how these make it hard on the reader. Here are two examples:

  • Some of my students speak English as a second language. Some speak/write it well; some are trying hard to communicate well but make grammar/spelling errors that don’t interfere with my ability to understand their writing. Some write sentences that I cannot understand. They expect me to guess their meaning and they expect me to bias my guessing in their favor. During interactive grading, I can read a sentence, realize that I can’t understand it, and then ask the student what it means. In some cases, the student had no idea (they were, as far as I can tell, bluffing, hoping I would give them points for nothing). In other cases, the student intended a meaning but they realized during the discussion that they were not conveying that meaning. For some students, this is a surprise. They didn’t realize that they were failing to communicate. Some have said to me that they had previously thought they were being downgraded for minor errors in their writing (spelling, grammar) rather than for writing something that the instructor could not understand. For some students, I think this changes their motivation to improve their writing.
  • Some students write disorganized answers, or answers that are organized in a fundamentally different way from the structure explicitly requested by the question. Some students do this strategically (with the goal of disguising their ignorance). Others are simply communicating poorly. In my experience, retraining these students is very hard, but this gives me another opportunity to highlight the problems in the work, to demonstrate how the problems affect how I analyze and evaluate their work, and how they could do it differently. (For more on this, see the discussion in the next section, Benefits for Me).

Benefits for Me

It’s easy to grade “A” work. The student “gets it” and so they get a high grade. When I recognize quickly that the student has met my objectives for a specific exam question, I stop analyzing it, award a very high grade, and move to the next question. Very fast.

In contrast, a typical “C” or “D” answer takes a lot longer to grade. The answer is typically disorganized, confused, hard to understand, rewords the question in an effort to present the question as its answer, seems to make inappropriate assumptions about what I should find obvious, has some mistakes, and/or has contradictions or inconsistencies that make the answer incoherent even though each individual component could be argued to be not-necessarily-wrong.

When I say, “disorganized”, I mean (for example) that if the question asks for parts (1), (2), (3), (4) and (5), the student will give three sections instead that address (in the first one) (1), (3) and (5), (in the second one) (1) and (2) and (in the third one) (3) and (5) with a couple of words from the question about part (4) but no added information.

I waste a lot of time trying to understand these answers. When I grade privately, I read a bad answer over several times, muttering as I try to parse the sentences and map the content to the question that was asked. It’s uncertain work–I constantly question whether I am actually understanding what the student meant and trying to figure out how much benefit of how much doubt I should give the student.

When I do interactive grading with a student, I don’t have to guess. I can say, “The question asked for Part 1. I don’t see an answer directly to Part 1. Can you explain how your answer maps to this part of the question?” I insist that the student map their actual words on the exam to the question that was asked. If they can’t sort it out for me, they can flunk. Next time, based on this experience, they can write in a way that will make it easier to map the answer to the question (make it easier for me to grade; get them a better grade).

This discussion is difficult, it can be unpleasant for the student and for me, I have to be diplomatic and somewhat encouraging or it will become a mess. But I don’t have to struggle alone with not understanding what was written. I can put the burden back on the student.

Some other benefits:

  • Students who write essays by copying content without understanding it demonstrate their cluelessness when we grade the essay interactively.
  • Students who cheat on a take-home exam (in my experience so far) avoid the interactive grading session, where they would probably demonstrate that they don’t understand their own answers.
  • Students who want to haggle with me about their grade have an opportunity to do so without driving me crazy. I now have a way to say, Show me what makes this a ‘B’ instead of arguing with them about individual points or about their situational need for a higher grade.

Maybe the most important benefits:

  • (Most) students tell me they like it and many of them ask to do it again (to transform a subsequent task into an interactive-graded one or to add interactive grading in another course)
  • Their performance seems to improve, sometimes significantly, which makes the next work easier to grade
  • This is highly personalized instruction. The student is getting one-on-one attention from the professor. Many students feel as though they don’t get enough personal attention and this addresses that feeling. It also makes some students a little more confortable with dropping by my office for advice at other times.

A More Detailed Report

Someone who is just trying to learn what interactive grading should stop here. What follows is “nuts and bolts”.

I’ve done interactive grading for three types of work:

  • midterm exams
  • practical assignments (homework that requires the student to apply something they have learned to a real-life task)
  • research essays

The overall process for interactive grading is the same for all three. But I’ve also noticed some differences. Marking exams isn’t the same as marking essays; neither is grading them interactively.

If you’re going to try interactive grading for yourself, the differences among descriptions of how it worked for these might help you make a faster and surer start.

Structural/Logistical Matters

I tell students what “interactive grading” is at the start of the course and I tell them which pieces of their work will be graded interactively. In most courses (in all of my university courses for credit), I tell them that this is a mandatory activity.

Some students choose not to participate in interactive grading sessions. Until now, I would tolerate this and grade their work in the traditional way. However, no one has ever given me a good reason for avoiding these (and I have run into several bad ones). The larger problem for me is that later in the course, when I want to do something that assumes that every student has had the interactive grading experience, it is inappropriate for students who skipped it. In the future, in academic courses, I will reinforce the “mandatory” nature of the activity by assigning a grade of zero on the work to a student who (after being warned of this) chooses not to schedule a grading meeting.

Scheduling the meetings is a challenge. To simplify it, I suggest a Doodle poll (www.doodle.com). Post the times you can be available and let each student sign up for a time (from your list) that s/he is available.

I schedule the meetings for 1.5 hours. They often end at 1 hour. Some drag on. If the meeting’s going to run beyond 2 hours, I usually force the meeting to a conclusion. If I think there will be enough value for the student, I offer to schedule a follow-up meeting to go over the rest of the work. If not, then either I announce the grade at the end of the session or I tell the student that I will grade the rest of the work in the traditional way and get back to them with the total.

During the meeting, the student sits across a desk from me. I use a computer with 3 monitors. Two of the monitors show the same information. I look at one and turn the other toward the student. Thus, the student and I can see the same things without having to crowd together to look over each other’s shoulder at one screen. The third screen faces me. I use it to look at any information that I don’t want to share with the student. I use 27″ monitors (small enough to fit on my desk, big enough to be readable for most students) with 1980×1020 resolution. This easily fits two readable word-processing windows side-by-side, such as the student’s submitted work in one window and the grading guide in the next window.

Example 1: Midterm Exams

In some of my courses, I give the students a list of questions well before the exam and draw the exam questions from the list. In my Software Testing 1 course, for example (the “Black Box Testing Course”), I give students a 100-question subset of this list: http://www.testingeducation.org/BBST/takingexams/ExamEssayQuestions2010.pdf

I outlined the costs and benefits of this approach at WTST 2003, in: https://13j276.p3cdn1.secureserver.net/pdfs/AssessmentTestingCourse.pdf. For our purposes, the most important benefit is that students have time before the exam to prepare an answer to each question. They can’t consult their prepared answers during the actual exam, but this lets them come to the exam well-prepared, with a clear idea of what the question means, how to organize the answer, and what points they want to make in it.

I typically give 2 or 3 midterms in a course and I am typically willing to drop the worst one. Thus the student can do poorly on the first midterm, use that experience to learn how to improve their study strategy and writing, and then do better on the next one(s). I do the interactive grading with the first midterm.

Students sometimes submit exams on paper (traditional, handwritten supervised exam), sometimes in a word processor file (supervised exam where students type at a university-owned computer in a university class/exam-room), sometimes in a word processor file (unsupervised takehome exam). For our purposes, assume that the student submitted an electronic exam.

Before I start grading any student’s work, I prepare a grading guide that identifies the types of information that I expect to see in the answer and the points that are possible for that particular type of info. If you’ve never seen that type of grading structure, look at my slides and videos (“How we grade exams”) at http://www.testingeducation.org/BBST/takingexams/.

Many of my questions are (intentionally) subject to some interpretation. Different students can answer them differently, even reaching contradictory conclusions or covering different technical information, but earn full points. The grading guide will allow for this, offering points for several different clusters of information instead of showing only One True Answer.

During the meeting, I rely frequently on the following documents, which I drag on and off of the shared display:

  • The student’s exam
  • The grading guide for that exam
  • The set of all of the course slides
  • The transcripts of the (videotaped) lectures (some or all of my lectures are available to the students on video, rather than given live)
  • The assigned papers, if any, that are directly relevant to exam questions.

We work through the exam one question at a time. The inital display is a copy of the exam question and a copy of the student’s answer. I skim it, often running my mouse pointer over the document to show where I am reading. If I stop to focus, I might select a block of text with the mouse, to show what I am working on now.

  • If the answer is well done, I’ll just say “good”, announce the grade (“That’s a 10 out of 10”) Then I skip to the next answer.
  • If the answer is confusing or sometimes if it is incomplete, I might start the discussion by saying to the student, “Tell me about this.” Without having seen my grading guide, the student tells me about the answer. I ask follow-up questions. For example, sometimes the student makes relevant points that aren’t in the answer itself. I tell the student it’s a good point and ask where that idea is in the written answer. I listed many of my other questions near the start of this report.
  • At some point, often early in the discussion, I display my grading guide beside the student’s answer, explain what points I was looking for, and either identify things that are missing or wrong in the student’s answer or ask the student to map their answer to the guide.
    • Typically, this makes it clear that the point is not in the answer, and what the grading cost is for that.
    • Some students try to haggle with the grading.
      • Some simply beg for more points. I ask them to justify the higher grade by showing how their answer provides the information that the grading guide says is required or creditable for this question.
      • Some tell me that I should infer from what they did write that they must have known the other points that they didn’t write about and so I should give them credit. I explain that I can only grade what they say, not what they don’t say but I think they know anyway.
      • Some agree that their answer as written is incomplete or unclear but they show in this meeting’s discussion that they understand the material much better than their answer suggests. I often give them additional credit, but we’ll talk about why it is that they missed writing down that part of the answer, and how they would structure an answer to a question like this in the future so that their performance on the next exam is better.
      • Some students argue with the analysis I present in the guide. Usually this doesn’t work, but sometimes I give strong points for an analysis that is different from mine but justifiable. I might add their analysis to the grading guide as another path to high points for that answer.
  • The student might claim that I am expecting the student to know a specific detail (e.g. definition or fact) that wasn’t taught in the course. This is when I search the slides and lecture transcripts and readings, highlighting the various places that the required information appeared. I try to lead from here to a different discussion–How did you miss this? What is the hole in your study strategy?
  • Sometimes at the end of the discussion, I ask the student to go to the whiteboard and present a good answer to the question. I do this when I think it will help the student tie together the ideas we’ve discussed, and from doing that, see how to structure answers better in the future. A good presentation (many of them are good enough) might take the assigned grade for the question from a low grade to a higher one. I might say to the student, “That’s a good analysis. Your answer on paper was worth 3/10. I’m going to record a 7/10 to reflect how much better a job you can actually do, but next time we won’t have interactive grading so you’ll have to show me this quality in what you write, not in the meeting. If you give an answer this good on the next exam, you’ll get a 9 or 10.

In a 1.5 hour meeting, we can only spend a few minutes on each question. A long discussion for a single question runs 20 minutes. One of my tasks is to move the discussion along.

Students often show the same weakness in question after question. Rather than working through the same thing in detail each time, I’ll simply note it. As a common example, I might say to a student

Here’s another case where you answered only 2 of the 3 parts of the question. I think you need to make a habit of writing an outline of your answer, check that the outline covers every part of the question, and then fill in the outline.

Other students were simply unprepared for the exam and most of their answers are simply light on knowledge. Once it’s clear that lack of preparation was the problem (often it becomes clear because the student tells me that was the problem), I speed up the meeting, looking for things to ask or say that might add value (for example complimenting a good structure, even though it is short on details). There is no value in dragging out the meeting. The student knows the work was bad. The grade will be bad. Any time that doesn’t add value will feel like scolding or punishment, rather than instruction.

The goal of the meeting is constructive. I am trying to teach the student how to write the next exam better. We might talk about how the student studied, how the student used peer review of draft answers, how the student outlined or wrote the answer, how the student resolved ambiguities in the question or in the course material — and how the student might do this differently next time.

Especially if the student achieved a weak grade, I remind the student that this is the first of three midterms and that the course grade is based on the best two. So far, the student has lost nothing. If they can do the next two well, they can get a stellar grade. For many students, this is a pleasant and reassuring way to conclude the meeting.

The statistical results of this are unimpressive. For example, in a recent course, 11 students completed the course.

  • On midterm 1 (interactively graded) their average grade was 76.7.
  • On midterm 2, the average grade was 81.5
  • On midterm 3, the average grade was 76.3
  • On the final exam, the average grade was 77.8

Remember that I gave students added credit for their oral presentation during interactive grading, which probably added about 10 points to the average grade for midterm 1. Also, I think I grade a little more strictly toward the end of the course and so a B (8/10) answer for midterm 2 might be a B- (7.5) for midterm 3 or the final. Therefore, even though these numbers are flat, my subjective impression of the underlying performance was that it was improving, but this was not a powerful trend.

At the end of the meetings, I asked students whether they felt this was a good use of their time and whether they felt it had helped them. They all told me that it did. In another class (metrics), some students who had gone through interactive grading in the testing course asked for interactive grading of the first metrics midterm. This seemed to be another indicator that they thought the meetings had been helpful and sufficiently pleasant experiences.

This is not a silver bullet, but the students and I feel that it was helpful.

Example 2: Practical Assignments

In my software testing class, I assign tasks that people would do in actual practice. Here are two examples that we have taken to interactive grading:

  1. The student joins the OpenOffice (OOo) project (https://blogs.apache.org/OOo/entry/you_can_help_us_improve) and reviews unconfirmed bug reports. An unconfirmed bug has been submitted but not yet replicated. The student tries to replicate it and adds notes to the report, perhaps providing a simpler set of steps to get to the failure or information that the failure shows up only on a specific configuration or with specific data. The student also writes a separate report to our class, evaluating the communication quality and the technical quality of the original report. An example of this assignment is here: http://www.testingeducation.org/BBST/bugadvocacy/AssignmentBugEvaluationv11.3.pdf
  2. The student picks a single variable in OpenOffice Writer (such as the number of rows in a table) and does a domain analysis of it. In the process, the student imagines several (about 20) ways the program could fail as a consequence of an attempt to assign a value to the variable or an attempt to use the variable once that value has been assigned. For each risk (way the program could fail), the student divides the values of the variable into equivalence classes (all the values within the same class should cause the test with that variable and that value to behave the same way) and then decides which one test should be used from each class (typically a boundary value). An example of this assignment is here: http://www.testingeducation.org/BBST/testdesign/AssignmentRiskDomainTestingFall2011.pdf

The Bug Assignment

When I meet with the student about the bug assignment, we review the bug report (and the student’s additions to it and evaluation of it). I start by asking the student to tell me about the bug report. In the discussion, I have several follow-up questions, such as

  • how they tried to replicate the report
  • why they stopped testing when they did
  • why they tested on the configurations they did
  • what happened when they looked for similar (often equivalent) bugs in the OOo bug database and whether they learned anything from those other reports,

In many cases, especially if the student’s description is a little confusing, I will bring up OpenOffice and try to replicate the bug myself. Everything I do is on the shared screen.They see what I do while I give a running commentary.

  • Sometimes I ask them to walk me through the bug. They tell me what to do. I type what they say. Eventually, they realize that the instructions they wrote into the bug report aren’t as clear or as accurate as they thought.
  • Sometimes I try variations on the steps they tried, especially if they failed to replicate the bug.

My commentary might include anecdotes about things that have happened (that I saw or did) at real companies or comments/demonstration of two ways to do essentially the same thing, with the second being simpler or more effective.

When I ask students about similar bugs in the database, most don’t know how to do a good search, so I show them. Then we look at the reports we found and decide whether any are actually relevant.

I also ask the student how the program should work and why they think so. What did they do to learn more about how the program should work? I might search for specifications or try other programs and see what they do.

I will also comment on the clarity and tone of the comments the student added to the OOo bug report, asking the student why they said something a certain way or why they included some details and left out others.

Overall, I am providing different types of feedback:

  • Did the student actually do the work that was assigned? Often they don’t do the whole thing. Sometimes they miss critical tasks. Sometimes they misunderstand the task.
  • Did the student do the work well? Bug reporting (and therefore this assignment) involves a mixture of persuasive technical writing and technical troubleshooting.
    • How well did they communicate? How much did they improve the communication of the original report? How well did they evaluate the communication quality of the original report?
    • How well did they troubleshoot? What types of information did they look for? What other types could they have looked for that would probably have been helpful? What parameters did they manipulate and how wise were their choices of values for those parameters?

Thinking again about the distinction between competence and performance,

  • Some students show performance problems (for example, they do the task poorly because they habitually follow instructions sloppily). For these students, the main feedback is about their performance problems.
  • Some students show competence problems–they are good at following instructions and they can write English sentences, but they need to learn how to do a better job of bug reporting. For these students, the main feedback is what they are already doing well and what professional-quality work of this type looks like.

This is a 4-phase assignment. I try to schedule the interactive grading sessions soon after Phase 1, because Phase 3 is a more sophisticated repetition of Phase 1 (similar task on a different bug report). The ideal case is Phase 1 — Feedback — Phase 3. Students who were able to schedule the sessions this way have told me that the feedback helped them do a much better job on Phase 3.

The Domain Testing Analysis

Every test technique is like a lens that you look through to see the program. It brings some aspects of the program into clear focus, and you test those in a certain way. It pretty much makes the other aspect of the program invisible.

In domain testing, you see a world of variables. Each variable stands out as a distinct individual. Each variable has sets of possible values:

  • The set of values that users might try to assign to the variable (the values you might enter into a dialog box, for example). Some of these values are invalid — the variable is not supposed to take on these values and the program should reject them.
  • The set of values that the variable might actually take on — other parts of the program will use this variable and so it is interesting to see whether this variable can take on any (“valid”) values that these parts of the program can’t actually handle.
  • The set of values that might be output, when this variable is displayed, printed, saved to disk, etc.

These sets overlap, but it is useful to recognize that they often don’t map perfectly onto each other. A test of a specific “invalid” value might be sometimes useful, sometimes uninteresting and sometimes impossible.

For most variables, each of these sets is large and so you would not want to test every value in the set. Instead, domain testing has you group values as equivalent (they will lead to the same test result) and then sample only one or two values from every set of equivalents. This is also called equivalence-class analyis and boundary testing.

One of the most challenging aspects of this task is adopting the narrow focus of the technique. People aren’t used to doing this. Even working professionals — even very skilled and experienced working professionals — may not be used to doing this and might find it very hard to do. (Some did find it very hard, in the BBST:Test Design course that I taught through the Association for Software Testing and in some private corporate classes.)

Imagine an assignment that asks you to generate 15 good tests that are all derived from the same technique. Suppose you imagine a very powerful test (or a test that is interesting for some other reason) that tests that variable of that program (the variable you are focusing on). Is it a good test? Yes. Is it a domain test? Maybe not. If not, then the assignment is telling you to ignore some perfectly good tests that involve the designated variable, maybe in order to generate other tests that look more boring or less powerful. Some people find this confusing. But this is why there are so many test techniques (BBST: Test Design catalogs over 100). Each technique is better for some things and worse for others. If you are trying to learn a specific technique (you can practice a different one tomorrow), then tests that are not generated by that technique are irrelevant to your learning, no matter how good they are in the general scheme of things.

We run into a conflict of intuitions in the practitioner community here. The view that I hold, that you see in BBST, is that the path to high-skill testing is through learning many different techniques at a high level of skill. To generate a diverse collection of good tests, use a diverse set of techniques. To generate a few tests that are optimized for a specific goal, use a technique that is appropriate for that goal. Different goals, different techniques. But to do this, you have to apply a mental discipline while learning. You have to narrow your focus and ask, “If I was a diehard domain tester who had no use for any other technique, how would I analyze this situation and generate my next tests?” Some people have told me this feels narrow-minded, that it is more important to train testers to create good tests and to get in touch with their inner creativity, than to cramp their style with a narrow vision of testing.

As I see it, this type of work doesn’t make you narrow-minded. It doesn’t stop you from using other techniques when you actually do testing at work. This is not your testing at work. It is your practice, to get good enough to be really good at work. Think of practicing baseball. When you are in batting practice, trying to improve your hitting, you don’t do it by playing catch (throwing and catching the ball), not even if it is a very challenging round of catch. That might be good practice, but it is not good batting practice.

The domain testing technique helps you select a few optimal test values from a much larger set of possibilities. That’s what it’s good for. We use it to pick the most useful specific values to test for a given variable. Here’s a heuristic: If you design a test that would probably yield the same result (pass/fail) now matter what value is in the variable, you are probably focused on testing a feature rather than a variable, and you are almost certainly not designing using domain testing.

Keeping in mind the distinction between competence and performance,

  • Some students show performance problems (for example, they do the task poorly because they habitually follow instructions sloppily). For these students, the main feedback is about their performance problems.
  • The students who show competence problems are generally having trouble with the idea of looking at the world through the lens of a single technique. I don’t have a formula for dealing with these students. It takes individualized questioning and example-creating, that doesn’t always work. Here are some of the types of questions:
    • Does this test depend on the value of the variable? Does the specific value matter? If not, then why do we care which values of the variable we test with? Why do a domain analysis for this?
    • Does this test mainly depend on the value of this variable or the value of some other variable?
    • What parts of the program would care if this variable had this value instead of that value?
    • What makes this specific value of the variable better for testing than the others?

From my perspective as the teacher, these are the hardest interactive grading discussions.

  • The instructions are very detailed but they are a magnet for performance problems that dominate the discussions with the weaker student.
  • The technique is not terribly hard, if you are willing to let yourself apply it in a straightforward way. The problem is that people are not used to explicitly using a cognitive lens, which is essentially what a test technique is.

Student feedback on this has been polite but mixed. Many students are enthusiastic about interactive grading and (say they) feel that they finally understood what I was talking about after the discussion that applied it to their assignment. Other students came away feeling that I had an agenda, or that I was inflexible.

Example 3: Research Essays

In my Software Metrics course, I require students to write two essays. In each essay, the student is required to select one (1) software metric and to review it. I give students an outline for the essay that has 28 sections and subsections, requiring them to analyze the validity and utility of the metric from several angles. They are to look in the research literature to find information for each section, and if they cannot find it, to report how they searched for that information (what search terms in which electronic databases) and summarize the results. Then they are to extrapolate from the other information they have learned to speculate what the right information probably is.

These are 4th year undergraduates or graduate students. A large percentage of these students lack basic library skills. They do not know how to do focused searches in electronic databases for scholarly information and they do not know how to assess its credibility or deal with conflicting results and conflicting conclusions. Many of them lack basic skills in citing references. Few are skilled at structuring a paper longer than 2 or 3 pages. I am describing good students at a well-respected American university that is plenty hard to get into. We have forced them to take compulsory courses in writing, but those courses only went so far and the students only paid so much attention. My understanding, having talked at length with faculty at other schools, is that this is typical of American computer science students.

My primary goal is to help students learn how to cut through the crap written about most metrics (wild claims pro and con, combined with an almost-shocking lack of basic information) so that they can do an analysis as needed on the job.

Metrics are important. They are necessary for management of projects and groups. Working with metrics will be demanded of most people who want to rise beyond junior-level manager or mid-level programmer and of plenty of people whose careers will dead-end below that level. But badly-used metrics can do more harm than good. Given that most (or all) of the software metrics are poorly researched, have serious problems of validity (or at least have little or no supporting evidence of validity), and have serious risk of causing side-effects (measurement dysfunction) to the organization that uses them, there is no simple answer to what basket of metrics should be used for a given context. On the other hand, we can look at metrics as imperfect tools. People can be pretty good at finding information if they understand what they are looking for and they understand the strengths and weaknesses of their tools. And they can be pretty good at limiting the risks of risky things. So rather than encouraging my students to adopt a simplistic, self-destructive (or worse, pseudo-moralistic) attitudeof rejection of metrics, I push them to learn their tools, their limits, their risks, and some ways to mitigate risk.

My secondary goal is to deal with the serious performance problems that these students have with writing essays.

I assign two essays so that they can do one, get detailed feedback via interactive grading, and then do another. I have done this in one (1) course.

Before the first essay was due, we had a special class with a reference librarian who gave a presentation on finding software-metrics research literature in the online databases of Florida Tech’s library and we had several class discussions on the essay requirements.

The results were statistically unimpressive: the average grade went from 74.0 to 74.4. Underneath the numbers, though, is the reality that I enforced a much higher standard on the second essay. Most students did better work on Essay 2, often substantially.

The process for interactive grading was a little different. I kept the papers for a week before starting interactive grading, so that I could check them for plagiarism. I use a variety of techniques for this (if you are curious, see the video course on plagiarism-detection at http://www.testingeducation.org/BBST/engethics/). I do not do interactive grading sessions with plagiarists. The meeting with them is a disciplinary meeting and the grade is zero.

Some students skirted the plagiarism line, some intentionally and others not. In the interactive grading session, I made a point of raising this issue, giving these students feedback on what I saw, how it could be interpreted and how/why to avoid it in the future.

For me, doing the plagiarism check first is a necessary prerequisite to grading an essay. That way, I have put the question behind me about whether this is actually the student’s work. From here, I can focus on what the student did well and how to improve it.

A plagiarism check is not content-focused. I don’t actually read the essay (or at least, I don’t note or evaluate its content, I ignore its structure, I don’t care about its style apart from looking for markers of copying, and I don’t pay attention to who wrote it). I just hunt for a specific class of possible problems. Thus, when I meet with the student for interactive grading, I am still reading the paper for the first time.

During the interactive grading session, I ask the student to tell me about the metric.

One of the requirements that I impose on the students (most don’t do it) is that they apply the metric to real code that they know, so that they can see how it works. As computer science students, they have code, so if they understand the metric, they can do this. I ask them what they did and what they found. In the rare case that the student has actually done this, I ask about whether they did any experimenting, changing the code a little to see its effect on the metric. With some students, this leads to a good discussion about investigating your tools. With the students who didn’t do it, I remind them that I will be unforgiving about this in Essay 2. (Astonishingly, most students still didn’t do this in Essay 2. This cost each of them a one-letter-grade (10-point) penalty.)

From there, I typically read the paper section by section, highlighting the section that I am reading so that the student can follow my progress by looking at the display. Within about 1/2 page of reading, I make comments (like, “this is really well-researched”) or ask questions (like, “this source must have been hard to find, what led you to it?”)

The students learned some library skills in the meeting with the librarian. The undergrads have also taken a 1-credit library skills course. But most of them have little experience. So at various points, it is clear that the student has not been able to find any (or much) relevant information. This is sometimes an opportunity for me to tell the student how I would search for this type of info, and them demonstrate that search. Sometimes it works, sometimes I find nothing useful either.

In many cases, the student finds only the cheerleading of a few people who created a metric or who enthuse about it in their consulting practices. In those reports, all is sunny and light. People might have some problems trying to use the metric, but overall (these reports say), someone who is skilled in its use can avoid those problems and rely on it. In the discussion, I ask skeptical questions. I demonstrate searches for more critical commentary on the metric. I point out that many of the claims by the cheerleaders lack data. They often present summaries of not-well-described experiments or case studies. They often present tables with numbers that came from not-exactly-clear-where. They often present reassuring experience reports. Experience reports (like the one you are reading right now) are useful. But they don’t have a lot of evidence-value. A well-written experience report should encourage you to try something yourself, giving you tips on how to do that, and encourage you to compare your experiences with the reporter’s. For experience reports to provide enough evidence of some claim to let you draw some conclusions about it, I suggest that you should require several reports that came from people not associated with each other and that made essentially the same claim, or had essentially the same thing/problem arise, or reached the same conclusion. I also suggest that you should look for counter-examples. During the meeting, I might make this same point and then hunt for more experience reports, probably in parallel with the student who is on their own computer while I run the search on mine. This type of attack on the data foundation of the metric is a bit familiar to these students because we use Bossavit’s book The Leprechauns of Software Engineering: How folklore turns into fact and what to do about it, as one of our texts.

In the essay, all students have performance problems (that is, all students have problems with the generic task of writing essays and the generic task of finding information) and so (in my extensive experience — 1 course), the discussion with all students switches back and forth between the content (the analysis of the metric and of the research data) and the generic parts of the assignment.

A Few More Thoughts

You can use interactive grading with anything that you can review. For example, Keith Gallagher uses interactive grading in his programming courses, reading the student’s code with them and asking questions about the architecture, style, syntax, and ability to actually provide the benefits the program is supposed to provide. His students have spoken enthusiastically about this to me.

Dr. Gallagher ends his sessions a little differently than I do. Rather than telling students what their grade is, he asks them what they think their grade should be. In the discussion that follows, he requires the student to view the work from his perspective, justifying their evaluation in terms of his grading structure for the work. It takes some skill to pull this off, but done well, it can give a student who is disappointed with their grade even more insight into what needs improvement.

Done with a little less skill, interactive grading can easily become an interaction that is uncomfortable or unpleasant for the student. It can feel to a student who did poorly as if the instructor is bullying them. It can be an interaction in which the instructor does all the talking. If you want to achieve your objectives, you have to intentionally manage the tone of the meeting in a way that serves your objectives.

Overall, I’m pleased with my decision to introduce interactive grading to my classes. As far as I can tell, interactive grading takes about the same amount of time as more traditional grading approaches. There are several benefits to students. Those with writing or language problems have the opportunity to demonstrate more skill or knowledge than their written work might otherwise suggest. They get highly personalized, interactive coaching that seems to help them submit better work on future assignments. Most students like interactive grading and consider it a worthwhile use of their time. Finally, I think it is important to have students feel they are getting a good value for their tuition dollar. This level of constructive personal attention contributes to that feeling.

This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.

CAST 2012 Metrics Talk Posted

Tuesday, August 7th, 2012

Title slide

I’ve posted the video and slide deck for the metrics talk Nawwar and I did at CAST. I hope you enjoy them.