On the Quality of Qualitative Measures
Cem Kaner, J.D., Ph.D. & Rebecca L. Fiedler, M.B.A., Ph.D.
This is an informal first draft of an article that will summarize some of the common guidance on the quality of qualitative measures.
- The immediate application of this article is to Kaner’s courses on software metrics and software requirements analysis. Students would be well-advised to read this summary of my lectures carefully (yes, this stuff is probably on the exam).
- The broader application is to the increasingly large group of software development practitioners who are considering using qualitative measures as a replacement for many of the traditional software metrics. For example, we see a lot of attention to qualitative methods in the agenda of the 2014 Conference of the Association for Software Testing. We won’t be able to make it to CAST this year, but perhaps these notes will provide some additional considerations for their discussions.
Managers have common and legitimate informational needs that skilled measurement can help with. They need information in order to (for example…)
- Compare staff
- Compare project teams
- Calculate actual costs
- Compare costs across projects or teams
- Estimate future costs
- Assess and compare quality across projects and teams
- Compare processes
- Identify patterns across projects and trends over time
Executives need these types of information, whether we know how to provide them or not.
Unfortunately, there are strong reasons to be concerned about the use of traditional metrics to answer questions like these. These are human performance measures. As such, they must be used with care or they will cause dysfunction (Austin, 1996). That has been a serious real-life problem (e.g. Hoffman 2000). The empirical basis supporting several of them has been substantially exaggerated (Bossavit, 2014). Many of the managers who use them know so little about mathematics that they don’t understand what their measurements mean and their primary uses are to placate management or intimidate staff. Many of the consultants who give talks and courses advocating metrics also seem to know little about mathematics or about measurement theory. They seem unable to distinguish strong questions from weak ones, unable to discuss the underlying validity of the measures they advocate, and so they seem reliant on appeals to authority, on the intimidating quality of published equations, and on the dismissal of the critic as a nitpicker or an apologist for undisciplined practices.
In sum, there are problems with the application of traditional metrics in our field. It is no surprise that people are looking for alternatives.
In a history of psychological research methods, Kurt Danziger (1994) discusses the distorting impact of quantification on psychological measurement. (See especially his Chapter 9, From quantification to methodolatry.) Researchers designed experiments that looked more narrowly at human behavior, ignoring (designing out of the research) those aspects of behavior or experience that they could not readily quantify and interpret in terms of statistical models.
“All quantitative data is based upon qualitative judgments.”
(Trochim, 2006 at http://www.socialresearchmethods.net/kb/datatype.php)
Qualitative methods might sometimes provide a richer description of a project or product that is less misleading, easier to understand, and more effective as a source of insight. However, there are problems with the application of qualitative approaches.
- Qualitative reports are, at their core, subjective.
- They are subject to bias at every level (how the data are gathered or selected, stored, analyzed, interpreted and reported). This is a challenge for every qualitative researcher, but it is especially significant in the hands of an untrained researcher.
- They are based on selected data.
- They aren’t very helpful for making comparisons or for providing quantitative estimates (like, how much will this cost?).
- They are every bit as open to abuse as quantitative methods.
- And it costs a lot of effort to do qualitative measurement well.
We are fans of measurement (qualitative or quantitative) when it is done well and we are unenthusiastic about measurement (qualitative or quantitative) when it is done badly or sold overenthusiastically to people who aren’t likely to understand what they’re buying.
Because this paper won’t propagandize qualitative measurement as the unquestioned embodiment of sweetness and light, some readers might misunderstand where we are coming from. So here is a little about our background.
- As an undergraduate, Kaner studied mainly mathematics and philosophy. He also took two semesters of coursework with Kurt Danziger. We only recently read Danziger (1994) and realized how profoundly Danziger has influenced Kaner’s professional development and perspective. As a doctoral student in experimental psychology, Kaner did some essentially-qualitative research (Kaner et al., 1978) but most of his work was intensely statistical, applying measurement theory to human perception and performance (e.g. Kaner, 1983). He applied qualitative methods to client problems as a consultant in the 1990’s. His main stream of published critiques of traditional quantitative approaches started in 1999 (Kaner, 1999a, 1999b). He wrote explicitly about the field’s need to use qualitative measures in 2002. He started giving talks titled “Software Testing as a Social Science” in 2004, explicitly identifying most software engineering measures as human performance measures subject to the same types of challenges as we see in applied measurement in psychology and in organizational management.
- Fiedler’s (2006, 2007) dissertation used Cultural-Historical Activity Theory (CHAT)–a framework for analyzing and organizing qualitative investigations–to examine portfolio management software in universities. Kaner & Fiedler started applying CHAT to scenario test design in 2007. We presented a qualitative-methods tutorial and a long paper with detailed pointers to the literature at CAST in 2009 and at STPCon in Spring 2013. We continue to use and teach these ideas and have been working for years on a book relating qualitative methods to the design of scenario tests.
We aren’t new to qualitative methods. This is not a shiny new fad for us. We are enthusiastic about increasing the visibility and use of these methods but we are keenly aware of the risk of over-promoting a new concept to the mainstream in ways that dilute the hard parts until all that remains are buzzwords and rituals. (For us, the analogies are Total Quality Management, Six Sigma, and Agile Development.)
Perhaps some notes on what makes qualitative measures “good” (and what doesn’t) might help slow that tide.
No, This is Not Qualitative
Maybe you have heard a recommendation to make project status reporting more qualitative. To do this, you create a dashboard with labels and faces. The labels identify an area or issue of concern, such as how buggy the software is. And instead of numbers, use colored faces because this is more meaningful. A red frowny-face says, There is trouble here. A yellow neutral-face says, Things seem OK, nothing particularly good or bad to report now. And a green smiley-face says, Things go well. You could add more differentiation by having a brighter red with a crying-face or a screaming-or-cursing-face and by having a brighter green with a happy-laughing face.
See, there are no numbers on this dashboard, so it is not quantitative, right?
The faces are ordered from bad to good. You can easily assign numerals to these, 1 for red-screaming-face through 5 for green-laughing-face, you can talk about the “average” (median) score across all the categories of information, you can even draw graphs of the change of confidence (or whatever you map to happyfacedness) from week to week across the project.
This might not be very good quantitative measurement but as qualitative measurement it is even worse. It uses numbers (symbols that are equivalent to 1 to 5) to show status without emphasizing the rich detail that should be available to explain and interpret the situation.
When you watch a consultant present this as qualitative reporting, send him away. Tell him not to come back until he actually learns something about qualitative measures.
OK, So What is Qualitative?
A qualitative description of a product or process is a detail-rich, multidimensional story (or collection of stories) about it. (Creswell, 2012; Denzin & Lincoln 2011; Patton, 2001).
For example, if you are describing the value of a product, you might present examples of cases in which it has been valuable to someone. The example wouldn’t simply say, “She found it valuable.” The example would include a description of what made it valuable, perhaps how the person used it, what she replaced with it, what made this one better than the last one, and what she actually accomplished with it. Other examples might cover different uses. Some examples might be of cases in which the product was not useful, with details about that. Taken together, the examples create an overall impression of a pattern – not just the bias of the data collector spinning the tale he or she wants to tell. For example, the pattern might be that most people who try to do THIS with the product are successful and happy with it, but most people who try to do THAT with it are not, and many people who try to use this tool after experience with this other one are likely to be confused in the following ways …
When you describe qualitatively, you are describing your perceptions, your conclusions, and your analysis. You back it up with examples that you choose, quotes that you choose, and data that you choose. Your work should be meticulously and systematically even-handed. This work is very time-consuming.
Quantitative work is typically easier, less ambiguous, requires less-detailed knowledge of the product or project as a whole, and is therefore faster.
If you think your qualitative measurement methods are easier, faster and cheaper than the quantitative alternatives, you are probably not doing the qualitative work very well.
Quality of Qualitative
In quantitative measurement, questions about the value of a measure boil down to questions of validity and reliability.
A measurement is valid to the extent that it provides a trustworthy description of the attribute being measured. (Shadish, Cook & Campbell, 2001)
A measurement is reliable to the extent that repeating the same operations (measuring the same thing in the same ways) yields the same (or similar) results.
In qualitative work, the closest concept corresponding to validity is credibility (Guba & Lincoln, 1989). The essential question about the credibility of a report of yours is, Why should someone else trust your work? Here are examples of some of the types of considerations associated with credibility.
Examples of Credibility-Related Considerations
The first and most obvious consideration is whether you have the background (knowledge and skill) to be able to collect, interpret and explain this type of data.
Beyond that, several issues come up frequently in published discussions of credibility. (Our presentation is based primarily on Agostinho, 2005; Creswell, 2012; Erlandson et al., 1993; Finlay, 2006; and Guba & Lincoln, 1989.)
- Did you collect the data in a reasonable way?
- How much detail?: Students of ours work with qualitative document analysis tools, such as ATLAS.ti, Dedoose, and NVivo. These tools let you store large collections of documents (such as articles, slides, and interview transcripts), pictures, web pages, and videos (https://en.wikipedia.org/wiki/Computer_Assisted_Qualitative_Data_Analysis_Software). We are now teaching scenario testers to use the same types of tools. If you haven’t worked with one of these, imagine a concept-mapping tool that allows you to save all the relevant documents as sub-documents in the same document as the map and allows you to show the relationships among them not just with a two-dimensional concept map but with a multidimensional network, a set of linkages from any place in any document to any place in any other document.
As you see relevant information in a source item, you can code it. Coding means applying meaningful tags to the item, so that you can see later what you were thinking now. For example, you might code parts of several documents as illustrating high or low productivity on a project. You can also add comments to these examples, explaining for later review what you think is noteworthy about them. You might also add a separate memo that describes your ideas about what factors are involved in productivity on this project, and another memo that discusses a different issue, such as notes on individual differences in productivity that seem to be confounding your evaluation of tool-caused differences. Later, you can review the materials by looking at all the notes you’ve made on productivity—all the annotated sources and all your comments.
You have to find (or create) the source materials. For example, you might include all the specification-related documents associated with a product, all the test documentation, user manuals from each of your competitors, all of the bug reports on your product and whatever customer reports you can capture for other products, interviews with current users, including interviews with extremely satisfied users, users who abandoned the product and users who still work with the product but hate it. Toss in status reports, comments in the source code repository, emails, marketing blurbs, and screen shots. All these types of things are source materials for a qualitative project.
You have to read and code the material. Often, you read and code with limited or unsophisticated understanding at first. Your analytical process (and your ongoing experience with the product) gives you more insight, which causes you to reread and recode material. The researcher typically works through this type of material in several passes, revising the coding structure and adding new types of comments (Patton, 2001). New information and insights can cause you to revise your analysis and change your conclusions.
The final report gives a detailed summary of the results of this analysis.
- Prolonged engagement: Did you spend enough time at the site of inquiry to learn the culture, to “overcome the effects of misinformation, distortion, or presented ‘fronts’, to establish rapport and build the trust necessary to overcome constructions, and to facilitate immersing oneself in and understanding the context’s culture”?
- Persistent observation: Did you observe enough to focus on the key elements and to add depth? The distinction between prolonged engagement and persistent observation is the difference between having enough time to make the observations and using that time well.
- Triangulation and convergence. “Triangulation leads to credibility by using different or multiple sources of data (time, space, person), methods (observations, interviews, videotapes, photographs, documents), investigators (single or multiple), or theory (single versus multiple perspectives of analysis).” (Erlandson et al. 1993, p. 137-138). “The degree of convergence attained through triangulation suggests a standard for evaluating naturalistic studies. In other words, the greater the convergence attained through the triangulation of multiple data sources, methods, investigators, or theories, the greater the confidence in the observed findings. The convergence attained in this manner, however, never results in data reduction but in an expansion of meaning through overlapping, compatible constructions emanating from different vantage points.” (Erlandson et al. 1993, p. 139).
- Are you summarizing the data fairly?
- How are you managing your biases (people are often not conscious of the effects of their biases) as you select and organize your observations?
- Are you prone to wishful thinking or to trying to please (or displease) people in power?
- Peer debriefing: Did you discussion your ideas with one or more disinterested peers who gave constructively critical feedback, questioned your ideas, methods, motivation, and conclusions?
- Disconfirming case analysis: Did you look for counter-examples? Did you revise your working hypotheses in light of experiences that were inconsistent with them?
- Progressive subjectivity: As you observed situations or created and looked for data to assess models, how much did you pay attention to your own expectations? How much did you consider the expectations and observations of others. An observer who affords too much privilege to his or her own ideas is not paying attention.
- Member checks: If you observed / measured / evaluated others, how much did you involve them in the process? How much influence did they have over the structures you would use to interpret the data (what you saw or heard or read) that you got from them? Do they believe you accurately and honestly represented their views and their experiences? Did you ask?
The concerns that underlie transferability are much the same as for external validity (or generalization validity) of traditional metrics:
- If you or someone else did a comparable study in a different setting, how likely is it that you would make the same observations (see similar situations, tradeoffs, examples, etc.)?
- How well would your conclusions apply in a different setting?
When evaluating a research report, thorough description is often a key element. The reader doesn’t know what will happen when someone tries a similar study in the future, so they (and you) probably cannot authoritatively predict generalizability (whether people in other settings will see the same things). However, if you describe what you saw well enough, in enough detail and with enough attention to the context, then when someone does perform a potentially-comparable study somewhere else, they will probably be able to recognize whether they are seeing things that are similar to what you were seeing.
Over time, a sense of how general something is can build as multiple similar observations are recorded in different settings
The concerns that underlie dependability are similar to those for internal validity of traditional metrics. The core question is whether your work is methodologically sound.
Qualitative work is more exploratory than quantitative (at least, more exploratory than quantitative work is traditionally described). You change what you do as you learn more or as you develop new questions. Therefore consistency of methodology is not an ultimate criterion in qualitative work, as it is for some quantitative work.
However, a reviewer can still ask how well (methodologically) you do your work. For example:
- Do you have the necessary skills and are you applying them?
- If you lack skills, are you getting help?
- Do you keep track of what you’re doing and make your methodological changes deliberately and thoughtfully? Do you use a rigorous and systematic approach?
Many of the same ideas that we mentioned under credibility apply here too, such as prolonged engagement, persistent observation, effort to triangulate, disconfirming case analysis, peer debriefing, member checks and progressive subjectivity. These all describe how you do your work.
- As issues of credibility, we are asking whether you and your work are worth paying attention to? Your attention to methodology and fairness reflect on your character and trustworthiness.
- As issues of methodology, we are asking more about your skill than about your heart.
Confirmability is as close to reliability as qualitative methods get, but the qualitative approach does not rest as firmly on reliability. The quantitative measurement model is mechanistic. It assumes that under reasonably similar conditions, the same acts will yield the same results. Qualitative researchers are more willing to accept the idea that, given what they know (and don’t know) about the dynamics of what they are studying, under seemingly-similar circumstances, the same things might not happen next time.
We assess reliability by taking repeated measurements (do similar things and see what happens). We might assess confirmability as the ability to be confirmed rather than whether the observations were actually confirmed. From that perspective, if someone else works through your data:
- Would they see the same things as you?
- Would they generally agree that things you see as representative are representative and things that you see as idiosyncratic are idiosyncratic?
- Would they be able to follow your analysis, find your records, understand your ways of classifying things and agree that you applied what you said you applied?
- Does your report give your reader enough raw data for them to get a feeling for the confirmability of your work?
Qualitative measurements tell a story (or a bunch of stories). The skilled qualitative researcher relies on transparency in methods and data to tell persuasive stories. Telling stories that can stand up to scrutiny over time takes enormous work. This work can have great value, but to do it, you have to find time, gain skill, and master some enabling technology. Shortchanging any of these areas can put your credibility at risk as decision-makers rely on your stories to make important decisions.
S. Agostinho, (2005, March). “Naturalistic inquiry in e-learning research“, International Journal of Qualitative Methods 4 (1).
R. D. Austin (1996). Measuring and Managing Performance in Organizations. Dorset House.
L. Bossavit (2014). The Leprechauns of Software Engineering: How folklore turns into fact and what to do about it. Leanpub.
J. Creswell (2012, 3rd ed.). Qualitative Inquiry and Research Design: Choosing Among Five Approaches. Sage Publications.
K. Danziger (1994). Constructing the Subject: Historical Origins of Psychological Research. Cambridge University Press.
N.K. Denzin & Y.S. Lincoln (2011, 4th ed.) The SAGE Handbook of Qualitative Research. Sage Publications.
D.A. Erlandson, E.L. Harris,B.L. Skipper & S.D. Allen, S. D. (1993). Doing Naturalistic Inquiry: A Guide to Methods. Sage Publications.
R. L. Fiedler (2006). “In transition”: An activity theoretical analysis examining electronic portfolio tools’ mediation of the preservice teacher’s authoring experience. Unpublished Ph.D. dissertation, University of Central Florida (Publication No. AAT 3212505).
R.L. Fiedler (2007). “Portfolio authorship as a networked activity“. Paper presented at the Society for Information Technology and Teacher Education.
R. L. Fiedler & C. Kaner (2009). “Putting the context in context-driven testing (an application of Cultural Historical Activity Theory)“. Conference of the Association for Software Testing. Colorado Springs, CO.
L. Finlay (2006). “‘Rigour’, ‘Ethical Integrity” or ‘Artistry”? Reflexively reviewing criteria for evaluating qualitative research.” British Journal of Occupational Therapy. 69 (7), 319-326.
E.G. Guba & Y.S. Lincoln (1989). Fourth Generation Evaluation. Sage Publications.
D. Hoffman (2000). “The darker side of metrics,” presented at Pacific Northwest Software Quality Conference, Portland, OR.
C. Kaner (1983). Auditory and visual synchronization performance over long and short intervals, Doctoral Dissertation: McMaster University.
C. Kaner (1999a). “Don’t use bug counts to measure testers.” Software Testing & Quality Engineering, May/June, 1999, p. 80.
C. Kaner (1999b). “Yes, but what are we measuring?” (Invited address) Pacific Northwest Software Quality Conference, Portland, OR.
C. Kaner (2002). “Measuring the effectiveness of software testers.”15th International Software Quality Conference (Quality Week), San Francisco, CA.
C. Kaner (2004). “Software testing as a social science.” IFIP Working Group 10.4 meeting on Software Dependability, Siena, Italy.
C. Kaner & R. L. Fiedler (2013). “Qualitative Methods for Test Design“. Software Test Professionals Conference (STPCon), San Diego, CA.
C. Kaner, B. Osborne, H. Anchel, M. Hammer & A.H. Black (1978). “How do fornix-fimbria lesions affect one-way active avoidance behavior?” 86th Annual Convention of the American Psychological Association, Toronto, Canada.
M.Q. Patton (2001, 3rd ed.). Qualitative Research & Evaluation Methods. Sage Publications.
W.R. Shadish, T.D. Cook & D.T. Campbell (2001, 2nd ed.). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Cengage.
W.M.K. Trochim (2006). Research Methods Knowledge Base.