This is the 4th section of my post on BBST 4.0. The other parts are at:
- 1. Background: What is BBST. (If you are already familiar with BBST, skip this)
- 2. Differences Between the Core BBST Courses and Domain Testing
- 3. Learning Objectives and Structure of Foundations 3.0 (2010)
- 5. Financial Model & Concluding Thoughts
What We Think Should Change
We are satisfied with the course standards and the supporting objectives (teaching students better learning skills). We want to make it easier for students to succeed by improving how we teach the course, but our underlying criteria for success aren’t going to change.
In contrast, we are not satisfied with the central content. Here are my notes:
1. Information objectives drive the testing mission and strategy
When you test a product, you have a defining objective. What are you trying to learn about the product? We’ll call this your information objective. Your mission is to achieve your information objective. So the information objective is more about what you want to know and the mission is more about what you will do. Your strategy describes your overall plan for actually doing/achieving it.
This is where we introduce context-driven testing. For example, sometimes testers will organize their work around bug-hunting but other times they will organize around getting an exciting product into the market as quickly as possible and making it improvable as efficiently as possible. The missions differ here, and so do the strategies for achieving them.
Critique:
- The distinction between information objective and mission is too fine-grained. It led to unhelpful discussions and questions. We’re going to merge the two concepts into “mission.”
- The presentation of context got buried in a mass of other details. We gave students an assignment, in which we described several contexts and they had to discuss how testing would differ in them. This was too hard for too many students. We have to provide a better foundation in the lecture and reading.
- We must present more contexts and/or characterize them more explicitly. Back when we created BBST 3.0, we treated some development approaches as ideas rather than as contexts. The world has moved on, and what was, to some degree, hypothetical or experimental, has become established. For example:
- In BBST 3.0, when we described a testing role that would fit in an agile organization, we avoided agile-development terminology (for reasons that no longer matter). In retrospect, that decision was badly outdated when we made it. Several different contexts grew out of the Agile Manifesto. Books like Crispin & Gregory illustrate the culture of testing in those situations. In BBST 4.0, we will treat this as a normal context
- Similarly, a very rapid development process is common in which you ship an early version, polish aggressively based on customer feedback, and make ongoing incremental improvements rapidly. Whether we think this approach is wonderful or not, we should recognize it as a common context. A context-respecting tester who works for, or consults to, a company that develops software this way is going to have to figure out how to provide valuable testing services that fit within this model.
- Context-driven testing isn’t for everyone. Context-driven testing requires the tester to adapt to the context—or leave. There are career implications here, but we didn’t talk much about them in BBST 3.0. The career issues were often visible in the classes, but there was no place in the course to work on them. This section (mission drives strategy) isn’t the place for them, but the course needs a place for career discussion.
Tentative decisions:
- Continue with this as the course opener.
- Merge information objectives and mission into one concept (mission) which drives strategy.
- Tighten up on the set of definitions. Definitions / distinctions that aren’t directly applicable to the course’s main concepts (such as the distinction between black box testing and behavioral testing) must go away.
- Create a separate, well-focused treatment of career paths in software testing. (This will become a supplementary video, probably presented in the lesson that addresses test automation.)
2. Oracles are heuristic
An oracle is a mechanism for determining whether a program passed or failed a test. In 1980, Elaine Weyuker pointed out that perfect oracles rarely exist and that in practice, we rely on partial oracles. Weyuker’s work wasn’t noticed by most of us in the testing community. The insight didn’t take hold until Doug Hoffman rediscovered it, presenting his Taxonomy of Test Oracles at Quality Week and then at an all-day session at the Los Altos Workshop on Software Testing (LAWST 5).
Foundations presented oracles from two conflicting perspectives, (a) Hoffman’s (and mine) and (b) James Bach & Michael Bolton’s.
- We started with an example from Bach’s Rapid Software Testing course, then presented the Bach and Bolton’s “heuristic oracles” (or “oracle heuristics”). A heuristic is a fallible but reasonable and useful decision rule. The heuristic aspect of a heuristic oracle is the idea that the behavior of software should usually (but not always) be consistent with a reasonable expectation. Bach developed a list of reasonable expectations, such as the expectation that the current version of the software will behave similarly to a previous version. This is usually correct, but sometimes it is wrong because of design improvement. Thus it is a heuristic. After the explanation, students worked through a challenging group assignment.
- Next, we presented Hoffman’s list of partial oracles and mentioned that these are useful supports for automated testing. We gave students some required readings, plus quiz questions and exam study guide questions but no assignment.
Critique:
This material worked well for RST. It did not work well in BBST. I published a critique: The Oracle Problem and the Teaching of Software Testing. and a more detailed analysis in the Foundations of Software Testing workbook. Here are some of my concerns:
- The terminology created at least as much confusion as insight. When we tested them, students repeatedly confused the meaning of the word “oracles” and the meaning of the word “heuristics.” Many students treated the words as equivalent and made statements like “all heuristics are oracles” (which is, of course, not true).
- The terminology is redundant and uninformative. Saying “heuristic oracle” is like saying “testly test.” The descriptor (heuristic, or testly) adds no new information. The word heuristic has been severely overused in software testing. A fallible decision rule is a heuristic. A cognitive tool that won’t always achieve the desired result is a heuristic. A choice to use a technological tool that won’t always achieve the desired result is a heuristic. Every decision testers make is rooted in heuristics because all of our decisions are made under uncertainty. Every tool we use can be called a heuristic because they are all imperfect and, given the impossibility of complete testing and the impossibility of perfect oracles, every tool we will ever use will be imperfect. This isn’t unique to software testing. Long before we started talking about this in software testing, Billy V. Koen’s award-winning introductions to engineering (often taught in general engineering courses) pointed this out that engineering reasoning and methods are rooted in heuristics. See Koen’s work here (my favorite presentation) and here and here. There is nothing more heuristic-like about oracles than there is about any other aspect of test design, or engineering in general.
- Heuristic is a magic word. Magic words provide a name for something while relieving you of the need to think further about it. The core issue with oracles is not that they are fallible. The core issue is that they are incomplete.
- The nature of incompleteness: An oracle will focus a test on one aspect of the software under test (or a few aspects, but just a few). Testers (and automated tests) won’t notice that the program behaves well relative to those aspects but misbehaves in other ways. Therefore, you can notice a failure if it is one that you’re looking for, but you never know whether the program actually passed a test. You merely learn that it didn’t fail the test.
- Human observers don’t eliminate this incompleteness. If you give a test to a human observer, with a well-defined oracle, they are likely to behave like a machine. They pay attention to what the oracle steers them to and don’t notice anything else. The phenomenon of not noticing anything else has been studied formally (inattentional blindness). If you don’t steer the observer with an oracle, you can get more diversity. If multiple people do the testing, different people will probably pay attention to different things, so across a group of people you will probably see greater coverage of the variety of risks. However, each individual is a finite-capacity processor. They have limited attention spans. They can only pay attention to a few things. When people split their attention, they become less effective in each task. An individual observer can introduce variation by paying attention to different potential problems in different runs of the same test.
- I don’t think you achieve effective oracle diversity by choosing to “explore” (another magic word, though one that I am more fond of). I think you achieve it by deliberately focusing on some things today that you know you did not focus on yesterday. We can do that systematically by intentionally adopting different oracles at different times. Thinking of it this way, oracle diversity is at least as much a matter of disciplined test design as it is a basis for exploration. (I learned this, and its importance for the design of suites of automated tests, from Doug Hoffman.)
- No aspect of the word “heuristic” takes us into these details. The word is a distraction from the question: How can we compensate for inevitable incompleteness, (a) when we work with automated test execution and evaluation and (b) when we work with human observers?
- There are two uses of oracles. We emphasized the one that is wrong for Foundations:
- The test-design oracle: An oracle is an essential part of the design of a test, or of a suite of tests. This is especially important in the design of automated tests. The oracle defines what problems the tests can see (and what problems they are blind to). Anyone who is learning how tests are designed should be learning about building suites of imperfect tests using partial oracles.
- The tester-expectations oracle: An oracle can be a useful component of a bug report. When you find a bug, you usually have to tell someone about it, and the telling often includes an explanation of why you think this behavior is wrong, and how important the wrongness is. For example, suppose you say, “this is wrong because it’s different from how it used to be.” You are relying on an expectation here and treating it as an oracle. The expectation is that the program’s behavior should stay the same unless the program is intentionally redesigned.
James Bach originally developed his list of tester expectations as a catalog of patterns of explanations that testers gave when they were asked to explain their bug reports. He recharacterized them as oracle heuristics (or heuristic oracles) years later. We think these are useful patterns, and when we create Bug Advocacy 4.0, we will probably include the ideas there (but maybe without the fancy oracular heuristical vocabulary).
- In Foundations, Bach and Bolton’s excellent presentation of tester-expectation oracles drowned out most students’ awareness of test-design oracles. That is, what we see on Foundations 3.0 exams is that many students forget, or completely ignore, the course’s treatment of partial oracles, remembering only the tester-expectation oracles.
Tentative decisions:
- All of the “heuristic” discussion of oracles is going away from Foundations.
- We will probably introduce the notion of heuristics somewhere in the BBST series. There is value in teaching people that testers’ reasoning and decisions are rooted in heuristics, and that this is normal in all applied sciences. However, we will teach it as its own topic. It creates a distraction when pasted onto oracles.
- We will tie the idea of partial oracles much more tightly to the design of suites of automated tests. We will focus on the problem that all oracles are incomplete, then transition into the questions:
- What are you looking for with this (test or) series of tests?
- What are these tests blind to?
- How will you look for those other potential problems?
- Can you detect these potential problems with code?
- If not, can you train people to look for them?
- Which problems are low-priority enough, or low-likelihood enough, that you should choose to run general exploratory tests and rely on luck and hope to help you stumble across them if they are there?
- Years ago, many test tool vendors (and many consultants) promised complete test automation from tools that were laughably inadequate for that purpose. These people promised “complete testing” that would replace all (or almost all) of your expensive manual testers. Several of us (including me) argued that these tools were weaker than promised. We pointed out that the tools provided a very incomplete look at the behavior of the software and even that often came with an excessively-high maintenance cost. These discussions were often adversarial and unpleasant. Sometimes very unpleasant. Arguments like these leave mental scars. I think these old scars are at the root of some of the test-automation skepticism that we hear from some of our leading old-timers (and their students). I think BBST can present a much more constructive view of current automation-support technology by starting from the perspective of partial oracles.
- Knott’s book on Mobile Application Testing provides an excellent overview of the variety of test-automation technologies, contrasting them in terms of the basis they use for evaluating the software (e.g. comparison to a captured screen) and the costs and benefits of that basis. These are all partial oracles. They come with their own costs, including tool cost, implementation difficulty, and maintenance cost and they come with their own limitations (what you can notice with this type of oracle and what you are blind to). I think this presentation of Knott’s, combined with updated material from Hoffman will probably become the core of the new oracle discussion.
- I think this is where we’ll treat automation issues more generally, discussing Cohn’s test automation pyramid (see Fowler too) (the value of extensive unit testing and below-the-UI integration testing) and Knott’s inverted pyramid. If I spend the time on this that I expect to spend, then along with describing the pyramid and the inverted pyramid and summarizing the rationales, I expect to raise the following issues:
- For traditional applications (situations in which we expect the pyramid to apply), I think the pyramid model substantially underestimates the need for end-to-end testing. For example:
- Performance testing is critical for many types of applications. To do this well, you need a lot of automated, end-to-end tests that model the behavior patterns of different categories of users.
- Some bugs are almost impossible to discover with unit-level or service-level automated tests or with manual end-to-end tests or with many types of automated end-to-end tests. In my experience, wild pointers, stack overflows, race conditions and other timing problems seem to be most effectively found by high-volume automated system-level tests.
- Security testing seems to require a combination of skilled manual attacks and routinized tests for standard vulnerabilities and high-volume automated probes.
- As we get better at these three types of automated system-level tests, I suspect that we will develop better technology for automated functional tests.
- For mobile apps, Knott makes a persuasive case that the pyramid has to be inverted (more system-level testing). This is because of an explosion of configuration diversity (significant variations of hardware and system software across phones) and because some tasks are location-dependent, time-dependent, or connection-dependent.
- I think there is some value in remembering that problems like this happened in the Windows and Unix worlds too. In DOS/Windows/Apple, we had to test application compatibility with printers printer-by-printer in each application. Same for compatibility with video cards, keyboards, rodents, network interfaces, sound cards, fonts, disk formats, disk drive interfaces, etc. I’ve heard of similar problems in Unix, driven more by variation in system hardware and architecture than by peripherals. Gradually, in the Windows/Apple worlds, the operating systems expanded their reach and presented standardized interfaces to the applications. Claims of successful standardization were often highly exaggerated at first, but gradually, variances between printer models (etc.) have become much simpler, smaller testing issues. We should look at the weakly-managed device variation in mobile phones as a big testing problem today, that we will have to develop testing strategies to deal with, while recognizing that those strategies will become obsolete as the operating systems mature. I think the historical perspective is important because of the risk that the mobile testers of today will become the test managers of tomorrow. There is a risk that they will manage for the problems they matured with, even though those problems have largely been solved by better technology and better O/S design. I think I see that today, with managers of my generation, some of whom seem still worried about the problems of the 1980’s/1990’s. Today’s students need historical perspective so that they can anticipate the evolution of system risks and plan to evolve with them.
- How does this apply to the inverted pyramid? I think that it is inverted today as a matter of necessity, but that 20 years from now, the same automation pyramid that applies to other applications will apply equally well to mobile. As the systems mature, the distortion of testing processes that we need to cope with immaturity will gradually fall away.
- For traditional applications (situations in which we expect the pyramid to apply), I think the pyramid model substantially underestimates the need for end-to-end testing. For example:
- Rather than arguing about whether automation is desirable (it is) and whether it is often extremely useful (it is) and whether it is overhyped (it is) and whether simplistic strategies that won’t work are still being peddled to people not sophisticated enough to realize what trouble they’re getting into (they are), I want people to come out of Foundations with a few questions. Here’s my current list. I’m sure it will change.:
- What types of information will we learn when we use this tool in this way?
- What types of information are we missing and how much do we care about which types?
- What other tools (or manual methods) can we use to fill in the knowledge gaps that we’ve identified?
- What are the costs of using this technology?
- Every added type of information comes at a cost. Are we willing to invest in a systematic approach to discovering a type of information about the software and if so, what will the cost be of that systematic approach? If the systematic approach is not feasible (too hard or too expensive), what manual methods for discovering the information are available, how effective are they, how expensive are they and how thorough a look are we willing to pay for?
Please send us good pointers to articles / blog posts that Foundations students can read to learn more about these questions (and ways to answer them). The course itself will go over this lightly. To address this in more depth would require a whole course, rather than a one-hour lecture. Students who are interested in this material (as every wise student will be) will learn most of what they learn on the topic from the readings.
A supplementary video:
The emphasis on automation raises a different cluster of issues having to do with the student’s career path. There used to be a solid career for a person who knew basic black box testing techniques and had general business knowledge. This person tested at the black box level, doing and designing exploratory and/or scripted tests. I think this will gradually die as a career. We have to note the resurgence of pieceworking, a type of employment where the worker is paid by the completed task (the “piece”) rather than by the time spent doing the task. If you can divide a task into many small pieces of comparable difficulty, you can spread them out across many workers. In general, I think pieceworking provides low incomes for people who are doing tasks that don’t require rare skills. Take a look at Soylent, “an add-in to Microsoft Word that uses crowd contributions to perform interactive document shortening, proofreading, and human-language macros.” Look over their site and read their paper. Think forward ten years. How much basic, system-level testing will be split out as piecework rather than done in-house or outsourced?
I’m not advocating for this change. I’m looking at a social context (that’s what context-driven people do) and saying that it looks like this is coming down the road, whether we like it or not.
There is already a big pay differential between traditional black box testers and testers who write code as part of their job. At Florida Tech, I see a big gap in starting salaries of our students. Students who go into traditional black box roles get much lower offers. As pieceworking gets more popular, I think the gap will grow wider. I think that students who are training today for a job as a traditional black-box tester are training for a type of job that will vanish or will stop paying a middle-class wage for North American workers. I think that introductory courses in software testing have a responsibility to caution students that they need to expand their skills if they want a satisfactory career in testing.
Some people will challenge my analysis. Certainly, I could be wrong. As Rob Lambert pointed out recently, people have been talking about the demise of generalist, black-box testers for years and years and they haven’t gone away yet. And certainly, for years to come, I believe there will be room for a few experts who thrive as consultants and a few senior testers who focus strictly on testing but are high-skill designers/planners. Those roles will exist, but how many?
Given my analysis (which could be wrong), I think it is the responsibility of teachers of introductory testing courses to caution students about the risk of working in traditional testing and the need to develop additional skills that can market well with an interest in testing.
- Combining testing and programming skill is the obvious path and the one that probably opens the broadest set of doors.
- Another path combines testing with deep business knowledge. If you know a lot about actuarial math and about the culture of actuaries, you might be able to provide very valuable services to companies that develop software for actuaries. However, your value for non-actuarial software might be limited.
- For some types of products or services, you might add unusual value if you have expertise in human factors, or in accessibility, or in statistical process control, or in physics. Again, you offer unusual value for companies that need your combination of skills but perhaps only generic value for companies that don’t need your specific skills.
- I don’t think that a quick, informed-amateur-level survey of popular topics in cognitive psychology will provide the kind of skills or the depth of skills that I have in mind.
It’s up to the student to figure out their career path, but it’s up to us to tell them that a career that has been solid for 50 years is changing character in ways they have to prepare for.
I’ve been trying to figure out how to focus a Lesson on this, or how to insert this into one of the other Lessons. I don’t think it will fit. There aren’t enough available hours in a 4-week online course. However, just as we gave application videos to students in the Domain Testing course (and some students watched only one video per Lesson while others watched several), I think I can create a watch-this-if-you-want-to video on career path that accompanies the lecture that addresses test automation.
3. Coverage is a multidimensional concept
My primary goal in teaching “coverage” was to open students’ eyes to the many different ways that they can measure coverage. Yes, there are structural coverage measures (statement coverage, branch coverage, subpath coverage, multicondition coverage, etc.) but there are also many black-box measures. For example, if you have a requirements specification, what percentage of the requirements items have you tested the program against?
Coverage is a measure, normally expressed as a percentage. There is a set of things that you could test: all the lines of code, or all the relevant flavors of smartphones’ operating systems, or all the visible features, or all the published claims about the program, etc. Pick your set. This is your population of possible test targets of a certain kind (target: thing you can test). The size of your set (the number of possible test targets) is your denominator. Count how many you’ve actually tested: that’s your numerator. Do the division and you get the proportion of possible tests of this kind that you have actually run. That’s your coverage measure. Statement coverage is the percentage of statements you’ve tested. Branch coverage is the percentage of branches you’ve tested. Visible-feature coverage is the percentage of visible features that you’ve tested thoroughly enough to say, “I’ve tested that one.”
Complete coverage is 100% coverage of all of the possible tests. The program is completely tested if there cannot be any undiscovered bugs (because you’ve run all the possible tests). For nontrivial programs, complete testing is impossible because the population of possible tests is infinite. So you can’t have complete coverage. You can only have partial coverage. And at that point, we teach you to subdivide the world into types of test targets and to think in terms of many test coverages. My main paper on testing coverage lists 101 different types of coverage. There are many more. Maybe you want to achieve 95% statement coverage and 2% mouse coverage (how many different rodents do you need to test?) and 100% visible feature coverage and so on. Evaluating the relative significance of the different types of coverage gives you a way to organize and prioritize your testing.
I had generally positive results teaching this in Foundations 1.0 and 2.0, but it was clear that some students didn’t understand programming concepts well enough to understand what the structural coverage measures actually were or how they might go about collecting the data. Around 2008, I realized that many of my university students (computer science majors) had this problem too. They had studied control structures, of course, but not in a way that they could easily apply to coverage measurement. In addition, they (and my non-programmer practitioner students) had repeatedly shown confusion about how computers represent data, how rounding error comes about in floating point calculations, why rounding error is inherent in floating point calculations, etc. These confusions had several impacts. For example some students couldn’t fathom overflows (including buffer overflows). Many students insisted that normal floating-point rounding errors were bugs. I had seen mistakes like these in real-life bug reports that damaged the credibility of the bug reporter. So, we decided that for Foundations 3.0, we would add basic information about programming and data representation to the course.
Critique:
- The programming material was very helpful for some students. For university students, it built a bridge between their programming knowledge and their testing lessons. Confusions that I had seen in Foundations 2.0 simply went away. For non-programmer professional students, the results were more mixed. Some students gained a new dimension of knowledge. Other students learned the material well enough to be tested on it but considered it irrelevant and probably never used it. And some students learned very little from this material, whined relentlessly and were irate that they had to learn about code in a course on black box testing.
- The programming material overshadowed other material in Lessons 4 and 5. Some students paid so much attention to this material that they learned it adequately, but missed just about everything else. Some students dropped the course out of fear of the material or because they couldn’t find enough time to work through this material at their pace.
- Many students never understood that coverage is a measurement. Instead, they would insist that “statement coverage” means all statements have been tested, and generally, that X coverage means that all X’s have been tested. I think we could have done a much better job with this if we had more time.
- The lecture also addressed the risk of driving testing to achieve high coverage of a specific kind. I taught Marick’s How to Misuse Code Coverage, which describes the measurement dysfunction that he saw in organizations that pushed their testing to achieve very high statement/branch coverage. The students handled this material well on exams. The programming instruction probably helped with this.
Tentative decisions:
- Return to a lecture that emphasizes the measurement aspects of coverage, the multidimensional nature of coverage, and the risks of driving testing to achieve high coverage (rather than high information).
- Reduce the programming instruction to an absolute minimum. We need to teach a little bit about control structures in order to explain structural coverage, but we’ll do it briefly. This will create the same problems for students as we had in Foundations 1.0 and 2.0.
- Create a series of supplementary videos that describe data representation and control structures. I’ll probably use the same lecture material in my Introduction to Programming in Java (Computer Science 1001). We’ll encourage students to watch the videos, probably offer a quiz on the material to help them learn it, maybe offer a bonus question on the exam, but won’t require the students to look at it.
4. Complete testing is impossible
This lecture teaches students that it is impossible to completely test a program. In the course of this instruction, we teach some basic combinatorics (showing students how to calculation the number of possible tests of certain kinds) and expose them again, in a new way, to the wide variety of types of test targets (things you can test) and the difficulty of testing all of them. We teach the basics of data flows, the basics of path testing, and provide a real-life example of the kinds of bugs you can miss if you don’t do extensive path testing.
Critique:
- Overall, I think this material was OK in Foundations 3.0.
- The lecture introduces a strong example of a real-life application of high-volume automated testing but provides only the briefest introduction to the concept of high-volume automated testing. The study guide (list of potential exam questions) included questions on high-volume automated testing but we were reluctant to ask any of them because we treated the material too lightly.
Tentative decisions:
- Cover essentially the same material.
- Integrate more tightly with the preceding material on the many types of coverage.
- Add a little more on high-volume automated testing.
5. Measurement is important, but difficult
The lecture introduces students to the basics of measurement theory, to the critical importance of construct validity, to the prevalent use of surrogate measures and the difficulties associated with this, and to measurement dysfunction and distortion. We illustrate measurement dysfunction with bug-count metrics.
Critique:
- Overall, the section was reasonably effective in conveying what it was designed to convey, but that design was flawed.
- The lecture exuded mistrust of bad metrics. It provided no positive inspiration for doing good measurement and very little constructive guidance. Back in 2001, Pettichord and Bach and I wrote, “Metrics that are not valid are dangerous.” We inscribed the words in Lessons Learned in Software Testing and in the statement of the Context-Driven Testing Principles.
- Back in the 1990’s (and still in 2001), we didn’t have much respect for the metrics work that had been done and we thought that bright people could easily make a lot of progress toward developing valid, useful metrics. We were wrong. The next dozen years taught me some humility. The context-driven testing community (and the broader software engineering community) made remarkably little progress on improving software-related measurement. Some of us (including me) have made some unsuccessful efforts. I don’t see any hint of a breakthrough on the horizon. Rather, I see more consultants who simply discourage clients and students from taking measurements or who present oversimplified descriptions of qualitative measurement that don’t convey the difficulties and the cost of doing qualitative measurement well. This stuff plays well at conferences but it’s not for me.
- I am against bad measurement (and a lot of terrible measurement is being advocated in our field, and students need some help recognizing it). But I am not against measurement. Back in the 1990s and early 2000s, I discouraged people from taking measurement, because I thought a better set of metrics was around the corner. It’s not. Eventually, my very wise friend, Hung Quoc Nguyen counseled me that it was time for me to rethink my position. He argued that the need for data to guide management is still with us — that if we don’t have better measures available, we need to find better ways to work with what we’ve got. I have come to agree with his analysis.
- Many people in our field are required to provide metrics to their management. We can help them improve what they provide. We can help them recognize problems and suggest better alternatives. But we aren’t helping them if we say that providing metrics is unethical and they need to say refuse to do it, even if that costs them their jobs. I wrote about this in Contexts differ: Recognizing the difference between wrong and Wrong and then in Metrics, Ethics, & Context-Driven Testing (Part 2).
- Becky Fiedler and I have used qualitative approaches several times. We’re enthusiastic about them. See our paper on Putting the Context in Context-Driven Testing, for example. But we have no interest in misrepresenting the time and skill required for this kind of work or the limitations of it. This is a useful set of tools, but it is not a collection of silver bullets.
- Back in the 1990’s (and still in 2001), we didn’t have much respect for the metrics work that had been done and we thought that bright people could easily make a lot of progress toward developing valid, useful metrics. We were wrong. The next dozen years taught me some humility. The context-driven testing community (and the broader software engineering community) made remarkably little progress on improving software-related measurement. Some of us (including me) have made some unsuccessful efforts. I don’t see any hint of a breakthrough on the horizon. Rather, I see more consultants who simply discourage clients and students from taking measurements or who present oversimplified descriptions of qualitative measurement that don’t convey the difficulties and the cost of doing qualitative measurement well. This stuff plays well at conferences but it’s not for me.
Tentative decisions:
- We have to fundamentally rework this material. I will probably start from two places:
- In 2012, Nawwar Kabbani and I presented a long talk at CAST 2012 on Software Metrics: Threats to Validity.
- Becky and I also developed some nice tutorial material for my software metrics class at Florida Tech: On the Quality of Qualitative Measures.
- I think these provide a better starting point for a presentation on metrics that:
- Introduces basic measurement theory
- Alerts people to the risks associated with surrogate measures, weak validity, and abusive use of metrics
- Explains measurement distortion and dysfunction
- Introduces basic qualitative measurement concepts
- Lays out the difficulties of doing good measurement, peeks at the state of metrics in some other fields (not better than ours), and presents a more positive view on ways to gain useful information from imperfect statistics.
- The big challenge here, the enormous challenge, will be fitting this much content into a reasonably short lecture.