A new brand of snake oil for software testing

I taught a course last term on Quantitative Investment Modeling in Software Engineering to a mix of undergrad and grad students of computer science, operations research and business. We had a great time, we learned a lot about the market, about modeling, and about automated exploratory testing (more on this type of exploratory testing at this year’s Conference of the Association for Software Testing…)

In the typical undergraduate science curriculum, most of the experimental design we teach to undergraduates is statistical. Given a clearly formulated hypothesis and a reasonably clearly understood oracle, we learn how to design experiments that control for confounding variables, so that we can decide whether our experimental effect was statistically significant. We also teach some instrumentation, but in most cases, the students learn how to use well-understood instruments as opposed to how to appraise, design, develop, calibrate and then apply them.

Our course was not so traditionally structured. In our course, each student had to propose and evaluate an investment strategy. We started with a lot of bad ideas. (Most small investors lose money. One of the attributes of our oracle is, “If it causes you to lose money, it’s probably a bad idea.”) We wanted to develop and demonstrate good ideas instead. We played with tools (some worked better than others) and wrote code to evolve our analytical capabilities, studied some qualitative research methods (hypothesis-formation is a highly qualitative task), ran pilot studies, and then eventually got to the formal-research stages that the typical lab courses start at.

Not surprisingly, the basics of designing a research program took about 1/3 of the course. With another course, I probably could have trained these students to be moderately-skilled EVALUATORS of research articles. (It is common in several fields to see this as a full-semester course in a doctoral program.)

Sadly, few CS doctoral programs (and even fewer undergrad programs) offer courses in the development or evaluation of research, or if they offer them, they don’t require them.

The widespread gap between having a little experience replicating other people’s experiments and seeing some work on a lab, on the one hand, and learning to do and evaluate research on the other hand — this gap is the home court for truthiness. In the world of truthiness, it doesn’t matter whether the evidence in support of an absurd assertion is any good, as long as we can make it look to enough people as though good enough evidence exists. Respectable-looking research from apparently-well-credentialed people is hard for someone to dispute if, as most people in our field, one lacks training in critical evaluation of research.

The new brand of snake oil is “evidence-based” X, such as “evidence-based” methods of instruction or in a recent proposal, evidence-based software testing. Maybe I’m mistaken in my hunch about what this is about, but the tone of the abstract (and what I’ve perceived in my past personal interactions with the speaker) raise some concerns.

Jon Bach addresses the tone directly. You’ll have to form your own personal assessments of the speaker. But I agree with Jon that this does not sound merely like advocacy of applying empirical research methods to help us improve the practice of testing, an idea that I rather like. Instead, the wording  suggests a power play that seems to me to have less to do with research and more to do with the next generation of ISTQB marketing.

So let me talk here about this new brand of snake oil (“Evidence-Based!”), whether it is meant this way by this speaker or not.

The “evidence-based” game is an interesting one to play when most of the people in a community have limited training in research methods or research evaluation. This game has been recently fashionable in American education. In that context, I think it has been of greatest benefit to people who make money selling mediocritization. It’s not clear to me that this movement has added one iota of value to the quality of education in the United States.

In principle, I see 5 problems (or benefits, depending on your point of view). I say, “in principle” because of course, I have no insight into the personal motives and private ideas of Dr. Reid or his colleagues. I am raising a theoretical objection. Whether it is directly applicable to Dr. Reid and ISTQB is something you will have to decide yourself, and these comments are not sufficient to lead you to a conclusion.

  1. It is easy to promote forced results from worthless research when your audience has limited (or no) training in research methods, instrumentation, or evaluation of published research. And if someone criticizes the details of your methods, you can dismiss their criticisms as quibbling or theoretical. Too many people in the audience will be stuck making their decision about the merits of the objection on the personal persuasiveness of the speakers (which snake oil salesmen excel at) rather than on the underlying merits of the research.
  2. When one side has a lot of money (such as, perhaps, proceeds from a certification business), and a plan to use “research” results as a sales tool to make a lot more money, they can invest in “research” that yields promotable results. The work doesn’t have to be competent (see #1). It just has to support a conclusion that fits with the sales pitch.
  3. When the other side doesn’t have a lot of money, when the other side are mainly practitioners (not much time or training to do the research), and when competent research costs a great deal more than trash (see #2 and #5), the debates are likely to be one-sided. One side has “evidence” and if the other side objects, well, if they think the “evidence” is so bad,  they should raise a bunch of money and donate a bunch of time to prove it. It’s an opportunity for well-funded con artists to take control of the (apparent) high road. They can spew impressive-looking trash at a rate that cannot possibly be countered by their critics.
  4. It is easy for someone to do “research” as a basis for rebranding and reselling someone else’s ideas. Thus, someone who has never had an original thought in his life can be promoted as the “leading expert” on X by publishing a few superficial studies of it.  A certain amount of this goes on already in our field, but largely as idiosyncratic misbehavior by individuals. There is a larger threat. If a training organization will make more money (influence more standards, get its products mandated by more suckers) if its products and services have the support of “the experts”, but many of “the experts” are inconveniently critical, there is great marketing value in a vehicle for papering over the old experts with new-improved experts who have done impressive-looking research that gives “evidence-based” backing to whatever the training organization is selling. Over time, of course, this kind of plagiarism kills innovation by bankrupting the innovators. For companies that see innovation as a threat, however, that’s a benefit, not a problem. (For readers who are wondering whether I am making a specific allegation about any person or organization, I am not. This is merely a hypothetical risk in an academic’s long list of hypothetical risks, for you to think about  in your spare time.)
  5. In education, we face a classic qualitative-versus-quantitative tradeoff. We can easily measure how many questions someone gets right or wrong on simplistic tests. We can’t so easily measure how deep an understanding someone has of a set of related concepts or how well they can apply them. The deeper knowledge is usually what we want to achieve, but it takes much more time and much more money and much more research planning to measure it. So instead, we often substitute the simplistic metrics for the qualitative studies. Sadly, when we drive our programs by those simplistic metrics, we optimize to them and we gradually teach to the superficial and abandon the depth. Many of us in the teaching community in the United States believe that over the past few years, this has had a serious negative impact on the quality of the public educational system and that this poses a grave threat to our long-term national competitiveness.

Most computer science programs treat system-level software testing as unfit for the classroom.

I think that software testing can have great value, that it can be very important, and that a good curriculum should have an emphasis on skilled software testing. But the popular mix of ritual, tedium, and moralizing that has been passed off by some people as testing for decades has little to offer our field, and even less for university instruction. I think ISTQB has been masterful at selling that mix. It is easy to learn and easy to certify. I’m sure that a new emphasis, “New! Improved! Now with Evidence!” could market the mix even better. Just as worthless, but with even better packaging.

2 Responses to “A new brand of snake oil for software testing”

  1. Ken Mizell says:

    Hi Cem,
    I just stumbled across your article https://kaner.com/?p=84 while looking for information on building a test strategy. Although I didn’t attend the conference and therefore I didn’t hear the speakers address, I did go to the link you supplied and read their abstract for the talk.

    I have to say I wholeheartedly agree with your sentiments 1-5 above, but I also fail to see how the speakers call for evidence as to the effectiveness of testing behavior is in contradiction to the practicality you usually promote.

    Maybe I didn’t get what the rest of their talk ended up being about. I do think it’s a good idea to have a retrospective activity at the end of a testing/release cycle and re-assess your team’s effectiveness and the effectiveness of their behaviors during the cycle, and to base that measure on evidence. Now I do admit that this is not easy to do and it is often a subjective measure, because as you well know “bug count” or “Bug find rate” is not a measure of your team’s effectiveness, not if they are finding useless bugs or bugs that don’t matter. But there are certain things that are measureable, and those involved in reducing waste (defining waste loosely as activity that may be necessary but does not add value – like regression testing when all tests pass – type 1 muda). In that instance if it takes 3 testers 60 hours each to execute all of the necessary regression tests and they all pass, or it takes 1 tester 30 minutes + 2 days to maintain code to accomplish the same with automation, then you can objectively measure the time savings and that ‘evidence’ could lead you to conclude that it was a good idea to automate the regression tests thus freeing up 2 testers to do more exploratory testing. And you might look for other “wasteful” or repetitive tests that are necessary but expected to be low yield as additional candidates for automation.

    You could also do experiments where you had two different teams try differing strategies of testing against the same product of feature (if you had the luxury of being able to have 2 teams cover the same thing) at the same time and then look for evidence as to which was more effective at yielding high quality bugs, and maybe that might lead you to change the prioritization of differing tactics used in future testing cycles towards those deemed most effective for your product’s environment.

    I also agree with you about the degeneration of schools into the “easy but measurable” (multiple choice) rather than the subjective/deep learning that a passionate “master” or “Sensei” can teach a student. Maybe that was your point, a cautionary tale against relying solely on evidence to determine practice, and abandoning ideas that may not be as efficient but that may still have value.

  2. In response to Ken:

    I DID attend the conference, and I did see the talk in question, and for me, it set a new (low) standard. There was a rehash of what is considered “validation” in science (one that essentially touted academic research without any mention of standards for that research, threats to its validity, measurement and observation problems, etc. etc.). There was some “evidence” provided that Testing Is A Good Thing, based on one set of figures from a noted measurer of software whose approach to measurement is riddled with problems. There was a minor appeal to the kind quantitative research modeled on physics, where it’s clear to me that evaluation of testing must (as Cem has often pointed out) be rooted in social science approaches.

    I don’t think Cem is opposed to the use of evidence. A far greater threat is the idea that bad evidence is better than nothing. To me, that’s like saying a bad investment in the stock market is better than no investment in the stock market.

    —Michael B.