The “Failure” of Udacity

November 23rd, 2013

If you are not aware of it, Udacity is a huge provider of a type of online courses called MOOCs (Massive Open Online Courses). Recently, a founder of Udacity announced that he was disappointed in Udacity’s educational results and was shifting gears from general education to corporate training.

I was brought into some discussions of this among academics and students. A friend suggested that I slightly revise one of my emails for general readership on my blog. So here goes.

My note is specifically a reaction to two articles:

Udacity offers free or cheap courses. My understanding is that it has a completion rate of 10% (only 10% of the students who start, finish) and a pass rate of 5%. This is not a surprising number. Before there were MOOCs, numbers like this were reported for other types of online education in which students set their own pace or worked with little direct interaction with the instructor. For example, I heard that Open University (a school for which I have a lot of respect) had numbers like this.

I am not sure that 10% (or 5%) is a bad rate. If the result is that thousands of people get opportunities that they would otherwise not have had, that’s an important benefit—even if only 5% find the time to make full use of those opportunities.

In general, I’m a fan of open education. When I interviewed for a professorship at Florida Tech in 1999, I presented my goal of creating open courseware for software testing (and software engineering education generally). NSF funded this in 2001. The result has been the BBST course series, used around the world in commercial and academic courses.

Software testing is a great example of the need for courses and courseware that don’t fit within the traditional university stream. I don’t believe that we will see good undergraduate degree programs in software testing. Instead, advanced testing-specific education will come from training companies and professional societies, perhaps under the supervision/guidance of some nonprofits formed for this purpose, either in universities (like my group, the Center for Software Testing Education & Research) or in the commercial space (like ISTQB). As I wrote in a recent post, I believe we have to develop a better credentialing system for software testing.

We are going to talk about this in the Workshop on Teaching Software Testing (WTST 13, January 24-26, 2014). The workshop is focused on Teaching Advanced Courses in Software Testing. It seems clear from preparatory discussions that this topic will be a springboard for discussions of advanced credentials.

Back to the MOOCs.

Udacity (and others) have earned some ill-will in the instructional community. There have been several types of irritants, such as:

  • Some advocates of MOOCs have pushed the idea that MOOCs will eliminate most teaching positions. After all, if you can get a course from one of the world’s best teachers, why settle for second best? The problem with this is that it assumes that teaching = lectures. For most students, this is not true. Students learn by doing things and getting feedback. By writing essays and getting feedback. By writing code and getting feedback. By designing tests and getting feedback. The student activities—running them, coaching students through them, critiquing student work, suggesting follow-up activities for individuals to try next—do not easily scale. I spent about 15 hours this week in face-to-face meetings with individual students, coaching them on statistical analysis or software testing. Next week I will spend about 15 hours in face-to-face meetings with local students or Skype sessions with online students. This is hard work for me, but my students tell me they learn a lot from this. When people dismiss the enormous work that good teachers spend creating and supporting feedback loops for their students—especially when people who stand to make money from convincing customers and investors that this work is irrelevant—those teachers sometimes get annoyed.
  • Some advocates of MOOCs, and several politicians and news columnists, have pushed the idea that this type of education can replace university education. After all, if you can educate a million students at the same time (with one of the world’s best teachers, no less), why bother going to a brick-and-mortar institution? It is this argument that fails when 95% of the students flunk out or drop out. But I think it fails worse when you consider what these students are learning. How hard are the tests they are taking or the assignments they are submitting? How carefully graded is the work—not just how accurate is the grading, though that can certainly be a big issue with computerized grading—but also, how informative is the feedback from grading? Students pay attention to what you tell them about their work. They learn a lot from that, if you give them something to learn from. My impression is that many of the tests/exams are superficial and that much of the feedback is limited and mechanical. When university teachers give this quality of feedback, students complain. They know they should get better than that at school.
  • Proponents of MOOCs typically ignore or dismiss the social nature of education. Students learn a lot from each other. Back when I paid attention to the instructional-research literature, I used to read studies that reported graduating students saying they learned more from each other than from the professors. There are discussion forums in many (most? all?) MOOCs, but from what I’ve seen and been told by others, these are rarely or never well moderated. A skilled instructor keeps forum discussions on track, moves off-topic posts to another forum, asks questions, challenges weak answers, suggests readings and follow-up activities. I haven’t seen or heard of that in the MOOCs.

As far as I can tell, in the typical MOOC course, students get lectures that may have been fantastically expensive to create, but they get little engagement in the course beyond the lectures. They are getting essentially-unsupervised online instruction. And that “instruction” seems to be a technologically-fancier way of reading a book. A fixed set of material flows from the source (the book or the video lecture) to the student. There are cheaper, simpler, and faster ways to read a book.

My original vision for the BBST series was much like this. But by 2006, I had abandoned the idea of essentially-unsupervised online instruction and started working on the next generation of BBST, which would require much more teacher-with-student and student-with-student engagement.

There has been relentless (and well-funded) hype and political pressure to drive universities to offer credit for courses completed on Udacity and platforms like it. Some schools have succumbed to the pressure.

The political pressure on universities that arises from this model has been to push us to lower standards:

  • lower standards of interaction (students can be nameless cattle herded into courses where no one will pay attention to you)
  • lower standards of knowledge expectation (trivial, superficial machine grading of the kind that can scale to a mass audience)
  • lower standards of instructional design (good design starts from considering what students should learn and how to shepherd them through experiences that will help them achieve those learning objectives. Lecture plans are not instructional design, even if the lectures are well-funded, entertaining and glitzy.)

Online instruction doesn’t have to be simplistic, but when all that the public see in the press is well-funded hype that pushes technoglitz over instructional quality, people compare what they see with what is repeated uncritically as if it was news.

The face-to-face model of instruction doesn’t scale well enough to meet America’s (or the world’s) socioeconomic needs. We need new models. I believe that online instruction has the potential to be the platform on which we can develop the new models. But the commoditizing of the instructor and the cattle-herding of the students that have been offered by the likes of Udacity are almost certainly not the answer.

Quality – which I measure by how much students learn – costs money. Personal interaction between students and instructors, significant assignments that get carefully graded and detailed feedback – costs money. It is easy to hire cheap assistants or unqualified adjuncts but it takes more than a warm body to provide high quality feedback. (There are qualified adjuncts, but the law of supply and demand has an effect when adjunct pay is low.)

The real cost of education is not the money. Yes, that is hugely significant. But it is as nothing compared to the years of life that students sacrifice to get an education. The cost of time wasted is irrecoverable.

In the academic world, there are some excellent online courses and there has been a lot of research on instructional effectiveness in these courses. Many online courses are more effective—students learn more—than face-to-face courses that cover the same material. But these are also more intense, for the teacher and the students. The students, and their teachers, work harder.

Becky Fiedler and I formed Kaner Fiedler Associates to support the next generation of BBST courses. We started the BBST effort with a MOOC-like vision of a structure that offers something for almost nothing. Our understanding evolved as we created generations of open courseware.

I think we can create high-quality online education that costs less than traditional schooling. I think we can improve the ways institutions recognize students’ preexisting knowledge, reducing the cost (but not the quality) of credentials. But cost-reducing and value-improvement does not mean “free” or even “cheap.” The price has to be high enough to sustain the course development, the course maintenance, and the costs of training, providing and supervising good instructors. There is, as far as we can tell, no good substitute for this.

New Book: The Domain Testing Workbook

November 16th, 2013

Sowmya Padmanabhan, Doug Hoffman and I just published a new book together, The Domain Testing Workbook.

The book focuses on one (1) testing technique. Domain Testing is the name of a generalized approach to Equivalence Class Analysis and Boundary Testing. Our goal is to help you develop skill with this one technique. There are plenty of overviews of this technique: it is the most widely taught, and probably the best understood, technique in the field. However, we’re not aware of any other presentations that are focused on helping you go beyond an understanding of the technique to achieving the ability to use it competently.

This is the first of a series of books that are coming along slowly. Probably the next one will be on scenario testing, with Dr. Rebecca Fiedler, merging her deep knowledge of qualitative methodology with my hands-on use of scenarios in software design, software testing, and software-related human factors. Also in the works are books on risk-based testing and specification-based testing. We’ve been working on all of these for years. We learned much from the Domain Testing Workbook about how challenging it is to write a book with a development-of-skill instructional design goal. At this point, we have no idea what the schedule is for the next three. When the soup is ready, we’ll serve it.

This work comes out of a research proposal that I sent to the United States’ National Science Foundation (NSF) in 2001 (Improving the Education of Software Testers), which had, among its objectives:

  • “Identify skills involved in software testing.”
  • “Identify types of exercises that support the development of specific testing skills.”
  • “Create and publish a collection of reusable materials for exercises and tests.”
  • “Prototype a web-based course on software testing.”
  • “Create a series of workshops that focus on the teaching of software testing”

NSF approved the project, which made it possible for us to open the Center for Software Testing Education & Research (CSTER). The web-based course we had in mind is now available as the BBST series. The Workshops on Teaching Software Testing are now in their 13th year. And the Domain Testing Workbook is our primary contribution focused on “exercises that support the development of specific testing skills.”

When we started this project, we knew that domain testing would be the easiest technique to write this kind of book about. Over the years, we had two surprises that caused us to fundamentally rethink the structure of this book (and of other books that I’m still working on that are focused on scenario testing, risk-based testing, and specification-based testing):

  • Our first surprise (or better put, our first shock) was that we could teach students a broad collection of examples of the application of domain testing, confirm that they could apply what they had learned to similar problems, and yet these students would fail miserably when we gave them a slightly more complex problem that was a little different than they had seen before. This was the core finding of Sowmya’s M.Sc. thesis. What we had tripped over was the transfer problem, which is probably the central instructional-design problem in Science, Technology, Engineering & Mathematics (STEM) instruction. The classic example is of the student who did well in her or his calculus course but cannot figure out how to apply calculus to problems (like modeling acceleration) in their introductory physics course. These students can’t transfer what they learned in one course to another course or to more complex situations (such as using it at their job). We had believed that we could work around the transfer problem by building our instruction around realistic exercises/examples. We were mistaken.
    • Ultimately, we concluded that teaching through exercises is still our best shot at helping people develop skill, but that we needed to provide a conceptual structure for the exercises that could give students a strategy for approaching new problems. We created a schema—an 18-step cognitive structure that describes how we do a domain analysis—and we present every exercise and every example in the context of that schema. We haven’t done the formal, experimental research to check this that Sowmya was able to do with our initial approach—an experiment like that is time-consuming and expensive and our funding for that type of work ran out long ago. However, we have inflicted many drafts of the schema on our students at Florida Tech and we believe it improves their performance and organizes their approach to tasks like exams.
  • Our next surprise was that domain testing is harder to apply than we expected. Doug and I are experienced with this technique. We think we’re pretty good at it, and we’ve thought that for a long time. Many other testers perceive us that way too. For example, Testing Computer Software opens with an example of domain testing and talks about the technique in more detail throughout the book. That presentation has been well received. So, when we decided to write a book with a few dozen examples that increased in complexity, we were confident that we could work through the examples quickly. It might take more time to write our analysis in a way that readers could understand, but doing the analysis would be straightforward. We were sooooooo wrong. Yes, we could quickly get to the core of the problem, identifying how we would approach identifying equivalence classes and boundaries (or non-boundary most-powerful test cases) for each class. Yes, we could quickly list several good tests. We got far enough along that we could impress other people with what we knew and how quickly we could get there, but not far enough to complete the problem. We were often getting through about a third of the analysis before getting stuck. Doug would fly to Palm Bay (Florida, where I live) and we would work problems. Some problems that we expected to take a day to solve and explain took us a week.
    • As we came to understand what skills and knowledge we were actually applying when we slogged through the harder problems, we added more description of our background knowledge to the book. A presentation of our Domain Testing Schema—and the thinking behind it—grew from an expected length of about 30 pages to 200. Our presentation of 30 worked examples grew from an expected length of maybe 120 pages (4 pages each, with diagrams) to 190 pages.

We got a lot of help with the book. Our acknowledgments list 91 people who helped us think through the ideas in the book. Many others helped indirectly, such as many participants in the WTST workshops who taught us critical lessons about the instructional issues we were beating our heads against.

Perhaps the main advance in the clarity of the presentation came out of gentle-but-firm, collegial prodding by Paul Gerrard. Paul’s point was that domain testing is really a way for a tester to model the software. The tests that come out of the analysis are not necessarily tied to the actual code. They are tied to the tester’s mental model of the code. Our first reactions to these comments were that Paul was saying something obvious. But over time we came to understand his point—it might be obvious to us, but presentations of domain testing, including ours, were doing an inadequate job of making it obvious to our readers. This led us to import the idea of a notional variable from financial analysis as a way of describing the type of model that domain testing was leading us to make. We wrote in the book:

“A notional variable reflects the tester’s “notion” of what is in the code. The tester analyzes the program as if this variable was part of the code…. The notional variable is part of a model, created by the tester, to describe how the program behaves or should behave. The model won’t be perfect. That exact variable is probably not in the code. However, the notional variable is probably functionally equivalent to (works the same way as) a variable that is in the code or functionally equivalent to a group of variables that interact to produce the same behavior as this variable.”

This in turn helped us rethink parts of the Schema in ways that improved its clarity. And it helped us clarify our thinking about domain-related exploratory testing as a way of refining the modeling of the notional variables (bringing them into increasingly accurate alignment with the program implementation or, if the implementation is inadequate, with how the program should behave).

I hope you have a chance to read The Domain Testing Workbook and if you do, that you’ll post your reactions here or as a review on Amazon.

The 2014 Workshop on Teaching Software Testing is Expanded

October 28th, 2013

The 2014 Workshop on Teaching Software Testing is scheduled for January 24-26 in Melbourne, FL This year’s focus will be on advanced courses in software testing. You can read the details at http://wtst.org.

In addition, we will immediately follow WTST with the first live pilot of the Domain Testing class. This will be our first of the next generation of BBST classes. Rebecca Fiedler and I have been working on a new course design and we plan to use some new technology. Our first pilot will run in Melbourne from Jan 27-31, 2014. Our second live pilot is scheduled for May 12-16 at DeveloperTown in Indianapolis.

Are you interested in attending one of the pilot courses? If so, please apply here. If you are accepted as a Domain Testing Workshop participant, there will be a non-refundable deposit of $100 to defray the cost of meals, snacks, and coffee service. We’ll publish more details about the pilot courses as they become available.

 

Teaching a New Statistics Course: I need your recommendations

June 20th, 2013

Starting this Fall, I will be teaching Applied Statistics to undergraduate students majoring in Computer Science and in Software Engineering. I am designing the course around real-life examples (published papers, conference talks, blogged case studies, etc.) of the use of probability and statistics in our field. I need examples.

I prefer examples that show an appropriate use of a model or a statistic, but I will be glad to also teach from a few papers that use a model or statistic in a clearly invalid way. If you send me a paper, or a link to one, I would appreciate it if you would tell me what you think is good about it (for a stats student to study) and why you think that.

Background of the Students

The typical student in the course has studied discrete math, including some combinatorics, and has at least two semesters of calculus. Only some students will have multivariable calculus.

By this point in their studies, most of the students have taken courses in psychology, logic, programming (up to at least Algorithms & Data Structures), two laboratory courses in an experimental science (chemistry, physics, & biology), and some humanities courses, including a course on research sources/systems (how to navigate the research literature).

My Approach to the Course

In one semester, applied stats courses often cover the concept of probability, discrete distributions, continuous distributions, descriptive statistics, basic inferential statistics and an introduction to stochastic models. The treatment of these topics is often primarily theoretical, with close attention to the underlying theorems and to derivation of the attributes of the distributions and their key statistics. Here are three examples of frequently-used texts for these courses. I chose these because you can see their table of contents on the amazon site.

I’d rather teach a more applied course. Rather than working through the topics in a clear linear order, I’d like to start with a list of 25 examples (case studies) and teach enough theory and enough computation to understand each case.

For example, consider a technical support department receiving customer calls. How many staff do they need? How long will customers have to wait before their call is answered? To answer questions like these, students would learn about the Erlang distributions. I’d want them to learn some of the underlying mathematics of the distribution, the underlying model (see this description of the model for example) and how this practical model connects to the mathematical model, and gain some experience with the distribution via simulation.

Examples Wanted

The biggest complaint that students (at Florida Tech and at several other schools, long ago) have voiced to me about statistics is that they don’t understand how the math applies to anything that they care about now or will ever care about. They don’t see the application to their field or to their personal work. Because of that, I have a strong preference for examples from Computer Science / Software Engineering / Information Security / Computer Information Systems.

Several of my students will also relate well to biostatistics examples. Some will work well with quantitative finance.

In the ideal course (a few years from now), a section of the class that focuses on a specific statistic, a specific model, or a specific type problem will have links to several papers that cover essentially the same concepts but in different disciplines. Once I have a computing-related primary example of a topic, I can flesh it out with related examples from other fields.

Please send suggestions either as comments on this blog or by email to me. A typical suggestion would include

  • a link to a paper, a presentation, a blog post or some other source that I can reach and read.
  • comments from you identifying what part of the material I should teach
  • comments from you explaining why this is a noteworthy example
  • any other notes on how it might be taught, why it is important, etc.

Thanks for your help!

 

Credentialing in Software Testing: Elaborating on my STPCon Keynote

May 9th, 2013

A couple of weeks ago, I talked about the state of software testing education (and software testing certification) in the keynote panel at STPCon. My comments on high-volume test automation and qualitative methods were more widely noticed, but I think the educational issues are more significant.

Here is a summary:

  1. The North American educational systems are in a state of transition.
  2. We might see a decoupling of formal instruction from credentialing.
  3. We are likely to see a dispersion of credentialing—-more organizations will issue more diverse credentials.
  4. Industrial credentials are likely to play a more significant role in the American economy (and probably have an increased or continued-high influence in many other places).

If these four predictions are accurate, then we have thinking to do about the kinds of credentialing available to software testers.

Transition

For much of the American population, the traditional university model is financially unsustainable. We are on the verge of a national credit crisis because of the immensity of student loan debt.

As a society, we are experimenting with a diverse set of instructional systems, including:

  • MOOCs (massive open online courses)
  • Traditionally-structured online courses with an enormous diversity of standards
  • Low-cost face-to-face courses (e.g. community colleges)
  • Industrial courses that are accepted for university credit
  • Traditional face-to-face courses

Across these, we see the full range from easy to hard, from no engagement with the instructor to intense personal engagement, from little student activity and little meaningful feedback to lots of both. There is huge diversity of standards between course structures and institutions and significant diversity within institutions.

  • Many courses are essentially self-study. Students learn from a book or a lecturer but they get no significant assignments, feedback or assessments. Many people can learn some topics this way. Some people can learn many topics this way. For most people, this isn’t a complete solution, but it could be a partial one.
  • Some of my students prosper most when I give them free rein, friendly feedback and low risk. In an environment that is supportive, provides personalized feedback by a human, but is not demanding, some students will take advantage of the flexibility by doing nothing, some students will get lost, and some students will do their best work.
  • The students who don’t do well in a low-demand situation often do better in a higher-demand course, and in my experience, many students need both—-flexibility in fields that capture their imagination and structure/demand in fields that are less engrossing or that a little farther beyond the student’s current knowledge/ability than she can comfortably stretch to.

There is increasing (enormous) political pressure to allow students to take really-inexpensive MOOCs and get course credit for these at more expensive universities. More generally, there is increasing pressure to allow students to transfer courses across institutions. Most universities allow students to transfer in a few courses, but they impose limits in order to ensure that they transfer their culture to their students and to protect their standards. However, I suspect strongly that the traditional limits are about to collapse. The traditional model is financially unsustainable and so, somewhere, somehow, it has to crack. We will see a few reputable universities pressured (or legislated) into accepting many more credits. Once a few do it, others will follow.

In a situation like this, schools will have to find some other way to preserve their standards—-their reputations, and thus the value of their degree for their graduates.

Seems likely to me that some schools will start offering degrees based on students’ performance on exit exams.

  • A high-standards institution might give a long and complex set of exams. Imagine paying $15,000 to take the exam series (and get grades and feedback) and another $15,000 if you pass, to get the degree.
  • At the other extreme, an institution might offer a suite of multiple-guess exams that can be machine-graded at a much lower cost.

The credibility of the degree would depend on the reputation of the exam (determined by “standards” combined with a bunch of marketing).

Once this system got working, we might see students take a series of courses (from a diverse collection of providers) and then take several degrees.

Maybe things won’t happen this way. But the traditional system is financially unsustainable. Something will have to change, and not just a little.

Decoupling Instruction from Credentialing

The vision above reflects a complete decoupling of instruction from credentialing. It might not be this extreme, but any level of decoupling creates new credentialing pressures / opportunities in industrial settings.

Instruction

Instruction consists of the courses, the coaching, the internships, and any other activities the students engage in to learn.

Credentialing

Credentials are independently-verifiable evidence that a person has some attribute, such as a skill, a type of knowledge, or a privilege.

There are several types of credentials:

  • A certification attests to some level of competency or privilege. For example,
    • A license to practice law, or to do plumbing, is a certification.
    • An organization might certify a person as competent to repair their equipment.
    • An organization might certify that, in their opinion, a person is competent to practice a profession.
  • A certificate attests that someone completed an activity
    • A certificate of completion of a course is a certificate
    • A university degree is a certificate
  • There are also formal recognitions (I’m sure there’s a better name for this…)
    • Awards from professional societies are recognitions
    • Granting someone an advanced type of membership (Senior Member or Fellow) in a professional society is a recognition
    • Election to some organizations (such as the American Law Institute or the Royal Academy of Science) is a recognition
    • I think I would class medals in this group
  • There are peer recognitions
    • Think of the nice things people say about you on Linked-In or Entaggle
  • There are workproducts or results of work that are seen as honors
    • You have published X many publications
    • You worked on the development team for X

The primary credentials issued by universities are certificates (degrees). Sometimes, those are also certifications.

Dispersion of Credentialing

Anyone can issue a credential. However, the prestige, credibility, and power of credentials vary enormously.

  • If you need a specific credential to practice a profession, then no matter who endorses some other credential, or how nicely named that other credential is, it still won’t entitle you to practice that profession.
  • Advertising that you have a specific credential might make you seem more prestigious to some people and less prestigious to other people.

It is already the case that university degrees vary enormously in meaning and prestige. As schools further decouple instruction from degrees, I suspect that this variation will be taken even more seriously. Students of mine from Asia, and some consultants, tell me this is already the case in some Asian countries. Because of the enormous variation in quality among universities, and the large number of universities, a professional certificate or certification is often taken more seriously than a degree from a university that an employer does not know and respect.

Industrial Credentials

How does this relate to software testing? Well, if my analysis is correct (and it might well not be), then we’ll see an increase in the importance and value of credentialing by private organizations (companies, rather than universities).

I don’t believe that we’ll see a universally-accepted credential for software testers. The field is too diverse and the divisions in the field are too deep.

I hope we’ll see several credentialing systems that operate in parallel, reflecting different visions of what people should know, what they should believe, what they should be able to do, what agreements they are willing to make (and be bound by) in terms of professional ethics, and what methods of assessing these things are appropriate and in what depth.

Rather than seeing these as mutually-exclusive competing standards, I imagine that some people will choose to obtain several credentials.

A Few Comments On Our Current State

Software Testing has several types of credentials today. Here are notes on a few. I am intentionally skipping several that feel (to me) redundant with these or about which I have nothing useful to say. My goal is to trigger thoughts, not survey the field.

ISTQB

ISTQB is currently the leading provider of testing certifications in the world. ISTQB is the front end of a community that creates and sells courseware, courses, exams and credentials that align with their vision of the software testing field and the role of education within it. I am not personally fond of the Body of Knowledge that ISTQB bases its exams on. Nor am I fond of their approach to examinations (standardized tests that, to my eyes, emphasize memorization over comprehension and skill). I think they should call their credentials certificates rather than certifications. And my opinion of their marketing efforts is that they are probably not legally actionable, but I think they are misleading. (Apart from those minor flaws, I think ISTQB’s leadership includes many nice people.)

It seems to me that the right way to deal with ISTQB is to treat them as a participant in a marketplace. They sell what they sell. The best way to beat it is to sell something better. Some people are surprised to hear me say that because I have published plenty of criticisms of ISTQB. I think there is lots to criticize. But at some point, adding more criticism is just waste. Or worse, distraction. People are buying ISTQB credentials because they perceive a need. Their perception is often legitimate. If ISTQB is the best credential available to fill their need, they’ll buy it. So, to ISTQB’s critics, I offer this suggestion.

Industrial credentialing will probably get more important, not less important, over the next 20 years. Rather than wasting everyone’s time whining about the shortcomings of current credentials, do the work needed to create a viable alternative.

Before ending my comments on ISTQB, let me note some personal history.

Before ASTQB (American ISTQB) formed, a group of senior people in the community invited me into a series of meetings focused on creating a training-and-credentialing business in the United States. This was a private meeting, so I’m not going to say who sponsored it. The discussion revolved around a goal of providing one or more certification-like credentials for software testers that would be (this is my summary-list, not theirs, but I think it reflects their goals):

  • reasonably attainable (people could affort to get the credential, and reasonably smart people who worked hard could earn it),
  • credible (intellectually and professionally supported by senior people in the field who have earned good reputations),
  • scalable (it is feasible to build an infrastructure to provide the relevant training and assessment to many people), and
  • commercially viable (sufficient income to support instructors, maintainers of the courseware and associated documentation, assessors (such as graders of the students and evaluators of the courses), some level of marketing (because a credential that no one knows about isn’t worth much), and in the case of this group, money left over for profit. Note that many dimensions of “commercial viability” come into play even if there is absolutely no profit motive—-the effort has to support itself, somehow).

I think these are reasonable requirements for a strong credential of this kind.

By this point, ISEB (the precursor to ISTQB) had achieved significant commercial success and gained wide acceptance. It was on people’s minds, but the committee gave me plenty of time to speak:

  • I talked about multiple-choice exams and why I didn’t like them.
  • I talked about the desirability of skill-based exams like Cisco’s, and the challenges of creating courses to support preparation for those types of exams.
  • I talked about some of the thinking that some of us had done on how to create a skill-based cert for testers, especially back when we were writing Lessons Learned.

But there was a problem in this. My pals and I had lots of scattered ideas about how to create the kind of certification system that we would like, but we had never figured out how to make it practical. The ideas that I thought were really good were unscalable or too expensive. And we knew it. If you ask today why there is no certification for context-driven testing, you might hear a lot of reasons, including principled-sounding attacks on the whole notion of certification. But back then, the only reason we didn’t have a context-driven certification was that we had no idea how to create one that we could believe in.

So, what I could not provide to the committee was a reasonably attainable, credible, scalable, commercially viable system—-or a plan to create one.

The committee, quite reasonably, chose to seek a practical path toward a credential that they could actually create. I left the committee. I was not party to their later discussions, but I was not surprised that ASTQB formed and some of these folks chose to work with it. I have never forgotten that they gave me every chance to propose an alternative and I did not have a practical alternative to propose.

(Not long after that, I started an alternative project, Open Certification, to see if we could implement some of my ideas. We did a lot of work in that project, but it failed. They really weren’t practical. We learned a lot, which in turn helped me create great courseware—-BBST—-and other ideas about certification that I might talk about more in the future. But the point that I am trying to emphasize here is that the people who founded ASTQB were open to better ideas, but they didn’t get them. I don’t see a reason to be outraged against them for that.)

The Old Boys’ Club

To some degree, your advancement in a profession is not based on what you know. It’s based on who you know and how much they like you.

We have several systems that record who likes like you, including commercial ones (LinkedIn), noncommercial ones (Entaggle), and various types of marketing structures created by individuals or businesses.

There are advantages and disadvantages to systems based on whether the “right” people like you. Networking will never go away, and never should, but it seems to me that

Credentials based on what you know, what you can do, or what you have actually done are a lot more egalitarian than those based on who says they respect you.

I value personal references and referrals, but I think that reliance on these as our main credentialing system is a sure path to cronyism and an enemy of independent thinking.

My impression is that some people in the community have become big fans of reputation-systems as the field’s primary source of credentials. In at least some of the specific cases, I think the individuals would have liked the system a whole lot less when they were less influential.

Miagi-do

I’ve been delighted to see that the Miagi-do school has finally come public.

Michael Larsen states a key view succinctly:

I distrust any certification or course of study that doesn’t, in some way, actually have a tester demonstrate their skills, or have a chance to defend their reasoning or rationale behind those skills.

In terms of the four criteria that I mentioned above, I think this approach is probably reasonably attainable, and to me, it is definitely credible. Whether it scalable and commercially viable has yet to be seen.

I think this is a clear and important alternative to ISTQB-style credentialing. I hope it is successful.

Other Ideas on the Horizon

There are other ideas on the horizon. I’m aware of a few of them and there are undoubtedly many others.

It is easy to criticize any specific credentialing system. All of them, now known or coming soon, have flaws.

What I am suggesting here is:

  • Industrial credentialing is likely to get more important whether you like it or not.
  • If you don’t like the current options, complaining won’t do much good. If you want to improve things, create something better.


This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.

DOMA–Time for a change

March 28th, 2013

standformarriage

 

 

 

 

 

 

Some people have asked me why I posted this graphic. It is a token of solidarity with the Human Rights Campaign. This is one of several images that HRC supporters posted on their websites while the United States Supreme Court was hearing arguments about the Defense of Marriage Act.

Presentation on software metrics

March 26th, 2013

Most software metrics are terrible. They are as overhyped as they are poorly researched.

But I think it’s part of the story of humanity that we’ve always worked with imperfect tools and always will. We succeed by learning the strengths, weaknesses and risks of our tools, improving them when we can, and mitigating their risks.

So how do we deal with this in a way that manages risk while getting useful information to people we care about?

I don’t think there are easy answers to this, but I think several people in the field are grappling with this in a constructive way. I’ve seen several intriguing conference-talk descriptions in the last few months and hope to post comments that praise some of them later. But for now, here’s my latest set of notes on this: http://kaner.com/pdfs/PracticalApproachToSoftwareMetrics.pdf

 

Follow-Up Testing in the BBST Bug Advocacy Course

February 10th, 2013

Becky and I are working on a new version of the BBST courses (BBST-2014). In the interim, to support the universities teaching the course, the AST, and the many people who are studying the materials on their own, we’re publishing some of the ideas we have for clarifying the course. The first two in this series focused on oracles in the Foundations course and on interactive grading as a way to give students more detailed feedback. This post is about follow-up testing, which we cover in the current BBST-BA’s slides 44-68.

Some failures are significant and they occur under circumstances that seem fairly commonplace. These will either be fixed, or there will be a compelling reason to not fix them.

Other failures need work. When you see them, there’s ambiguity about the significance of the underlying bug:

  • Some failures look minor. Underlying the failure is a bug that might (or might not) be more serious than you’d imagine from the first failure.
  • Some failures look rare or isolated. Underlying that failure is a bug that might (or might not) cause failures more often, or under a wider range of conditions, than you’d imagine from the first failure.

Investigating whether a minor-looking failure reflects a more serious bug

To find out whether the underlying bugs are more serious than the first failures make them look, we can do follow-up testing. But what tests?

That’s a question that Jack Falk, Hung Quoc Nguyen and I wrestled with many times. You can see a list of suggestions in Testing Computer Software. Slides 44-56 sort those ideas (and add a few) into four categories:

  1. Vary your behavior
  2. Vary the options and settings of the program
  3. Vary data that you load into the program
  4. Vary the software and hardware environment

Some students get confused by the categories (which is not shocking, because the categories are a somewhat idiosyncratic attempt to organize a broad collection of test ideas), confused enough that they don’t do a good job on the exam of generating test ideas from the categories.

For example, they have trouble with this question:

Suppose that you find a reproducible failure that doesn’t look very serious.

  • Describe the four tactics presented in the lecture for testing whether the defect is more serious than it first appeared.
  • As a particular example, suppose that the display got a little corrupted (stray dots on the screen, an unexpected font change, that kind of stuff) in Impress when you drag the scroll bar up and down. Describe four follow-up tests that you would run, one for each of the tactics that you listed above.

I’m not going to solve this puzzle for you, but the solution should be straightforward if you understand the categories.

The slides work through a slightly different example:

A program unexpectedly but slightly scrolls the display when you add two numbers:

  • The task is entering numbers and adding
  • The failure is the scrolling.

Let’s consider the categories in terms of this example

1. Vary your behavior

When you run a test, you intentionally do some things as part of the test. For example, you might:

  • enter some data into input variables
  • write some data into data files tha tthe program will read
  • give the program commands

You might change any of these as part of your follow-up testing. (What happens if I do this instead of that?)

These follow-up tests might include changing the data or adding steps, substituting steps, or taking steps away.

For example, if adding one pair of numbers causes unexpected scrolling, suppose you try adding two numbers many times. Will the program scroll more, or scroll differently, as you repeat the test?

Suppose we modified the example so the program reads (and then adds) two numbers from a data file. Changing that data file would be another example of varying your behavior.

The slides give several additional examples.

2. Vary the options and settings of the program

Think about Microsoft Word. Here are some examples of its options:

  • Show (or don’t show) formatting marks on the screen
  • Check spelling (or don’t check it) as you type

In addition, you might change a template that controls the formatting of new documents. Some examples of variables you might change in the template are:

  • the default typeface
  • the spacing between tab stops
  • the color of the text

Which of these are “options” and which of these are “settings”? It doesn’t matter. The terminology will change from program to program. What doesn’t change is that these are persistent variables. Their value stays with the program from one document to another.

3. Vary data that you load into the program

This isn’t well worded. Students confuse what I’m talking about with the basic test data.

Imagine again testing Microsoft Word. Suppose that you are focused on the format of the document, so your test documents have lots of headings and tables and columns and headers and footers (etc.). If you change the document, save it, and load the revised one, that is data that you load into the program, but I think of that type of change as part of 1. Vary your behavior.

When Word starts up, it also loads some files that might not be intentional parts of your test. For example, it loads a dictionary, a help file, a template, and probably other stuff. Often, we don’t even think of these files (and how what they hold might affect memory or performance) when we design tests. Sometimes, changing one of these files can reveal interesting information.

4. Vary the software and hardware environment

For example,

  • Change the operating system’s settings, such as the language settings
  • Change the hardware (a different video card) or how the operating system works with the hardware (a different driver)
  • Change hardware settings (same video card, different display resolution)

This is a type of configuration testing, but the goal here is to try to demonstrate a more impressive failure, not to assess the range of circumstances under which the bug will appear.

Investigating whether an obscure-looking failure will arise under more circumstances

In this case, the failure shows up under what appear to be special circumstances. Does it only show up under special circumstances?

Slides 58-68 discuss this, but some additional points are made on other slides or in the taped lecture. Bringing them together…

  1. Uncorner your corner cases
  2. Look for configuration dependence
  3. Check whether the bug is new to this version
  4. Check whether failures like this one already appear in the bug database
  5. Check whether bugs of this kind appear in other programs

Here are a few notes:

1. Uncorner your corner cases

Risk-based tests often use extreme values (boundary conditions). These are good for exposing a failure, but once you find the failure, try less extreme values. Demonstration of failure with a less extreme test will yield a more credible report.

2. Look for configuration dependence

In this case, the question is whether the failure will appear on many configurations or just this one. Try it with more memory, with another version of operating system (or on another OS altogether), etc.

3. Check whether the bug is new to this version

Does this bug appear in earlier versions of the program? If so, did users (or others) complain about it?

If the bug is new, especially if it was a side-effect of a fix to another bug, some people will take it much more seriously than a bug that has been around a long time but rarely complained about.

4. Check whether bugs like this one already appear in the bug database

One of the obvious ways to investigate whether a failure appears under more general circumstances than the ones in a specific test is to check the bug tracking database to see whether the failure has in fact occurred under other circumstances. It often takes some creativity and patience to search out reports of related failures (because they probably aren’t reported in exactly the same way as you’re thinking of the current failure), but if your database has lots of not-yet-fixed bugs, such a search is often worthwhile.

5. Check whether bugs of this kind appear in other programs

The failure you’re seeing might be caused by the code specifically written by this programmer, or the bug might be in a library used by the programmer. If it’s in the library, the same bug will be in other programs. Similarly, some programmers model (or copy) their code from textbook descriptions or from code they find on the web. If that code is incorrect, the error will probably show up in other programs.

If the error does appear in other programs, you might find discussions of the error (how to find it, when it appears, how serious it is) in discussions of those programs.

This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.

An Overview of High Volume Automated Testing

January 14th, 2013

Summary: This overview describes the general concept of high volume automated testing (HiVAT) and twelve examples of HiVAT techniques. The common thread of these techniques is automated generation, execution and evaluation of arbitrarily many tests. The individual tests are often weak, but taken together, they can expose problems that individually-crafted tests will miss. The simplest techniques offer high coverage: they can reach a program’s weakness in handling specific special cases. Other techniques take the program through long sequences of tests and can expose programs that build gradually (e.g. memory leaks, stack corruption or memory corruption) or that involve unexpected timing of events. In my experience, these sequences have exposed serious problems in embedded software, especially in systems that involved interaction among a few processors. As we embed more software into more consumer systems, including systems that pose life-critical risks, I believe these tests will become increasingly important.

I am writing this note as an intentionally rough draft. It servers as an introduction for a course on HiVAT at Florida Institute of Technology. It provides a structure for work in the course. It is intentionally under-referenced. One of the students’ tasks in the course is to dig through a highly disjointed literature and map research papers, practitioner papers, and conference presentations to the techniques listed here. Students might also add HiVAT techniques, with associated papers, or add sections on directly-relevant automation-support technology or directly-relevant surveys of test automation strategies / results. I will replace this post with later drafts that are more academically complete as we make progress in the course.

Consider automated regression testing. We reuse a regression test several times–perhaps running it on every build. But are these really automated? The computer might execute the tests and do a simple evaluation of the results, but a human probably designed that test, a human probably wrote the test code that the computer executes, a human probably provided the test input data by coding it directly into the test or by specifying parameters in an input file, a human probably provided the expected results that the program uses to evaluate the test result, and if there appears that there might be a problem, it will be a human who inspects the results, does the troubleshooting and either writes a bug report (if the program is broken) or rewrites the test. All that work by humans is manual.

Notice, by the way, that this interplay of human work and computer work applies whether the tests are run at the system level using a tool like QuickTest Pro, at the unit level using a tool like jUnit, or at a hybrid level, using a tool like FIT.

So, automated regression test are actually manual tests with an automated tint.

And every manual software test is actually automated. When you run a manual test, you might type in the inputs and look at the outputs, but everything that happens from the acceptance of those inputs to the display of the results of processing them is done by a computer under the control of a program–that’s automated.

The difference between “manual” and “automated” is a matter of degree, not a matter of principle. Some tests are more automated, others are less automated, but there is no “all” and there is no “none” for test automation.

High-Volume Tests Focused on Inputs

High-Volume Parametric Variation

We can make the automated regression tests more automated by transforming some of the human tasks to computer tasks. For example, imagine testing a method that takes two inputs (FIRST, SECOND) and returns their sum (SUM). A typical regression test would include specific values for FIRST, SECOND, and EXPECTED_SUM. But suppose we replace the specific test values with a test data generator that supplies random values for FIRST and SECOND and calculates the value of EXPECTED_SUM. We can generate billions of tests this way, each a little different, with almost no increase in human effort.

This is one of the simplest examples of high volume automated testing. A human supplied the algorithm, but the test tool applies that algorithm to create, run, and evaluate the results of arbitrarily many tests.

In this example, those tests are all pretty similar. Why bother running them all? The traditional answer is “don’t bother.” Domain testing is most widely used software testing technique. The point of domain testing is to help the tester minimize test redundancy by selecting a small number of inputs to represent the larger set of possibilities. Usually, this works well. Occasionally, a program has a problem with a small number of specific inputs. For example, the program might be optimized in a way that processes a few specific values specially. Or it might be vulnerable to small calculation errors that are usually too small to notice, but occasionally have a visible effect (see Doug Hoffman’s report at http://www.testingeducation.org/BBST/foundations/Hoffman_Exhaust_Options.pdf for an example.

High-Volume Combination Testing

Rather than thinking of testing a single function that takes a few inputs, imagine something more complex. A program processes several inputs, perhaps with a series of functions, and reports an output. Again, the traditional approach is to minimize redundancy. When we do combination tests (test several variables together), the set of possible tests is usually huge. If the variables are independent, we can minimize the number of combination tests with combinatorial testing (e.g. all-pairs or all-triples). If the variables are related, we can use cause-effect graphing instead. But if we suspect that the program will fail only on a small number of specific combinations (and we don’t know which few those are), we have to test a larger set of combinations, generating input values and calculating the expected result for each combination.

There are different strategies for generating the input values:

  • Exhaustive sampling. This tests all the combinations but the set might be impossibly large.
  • Random sampling. Generate random values for the inputs, stopping when some large number have been tested.
  • Optimized sampling. Use an algorithm that optimizes the set of combinations in some way. As a very simple example, if you are going to test a million combinations, you could divide the space of combinations into a million same-size, non-overlapping subsets and sample one value of each. Or you could use a sequential algorithm that assigns values for the next combination by creating a test that is distant (according to some distance function) from all previous tests.

Input Fuzzing

So far, I’ve presented tests that generate both, the inputs and the expected value of the test. The expected value serves as an oracle, a mechanism for deciding whether the program passed or failed the test. Test oracles are incomplete, but they are useful for automation. Even if we can’t detect all possible failures, we can detect any failures of a certain kind (such as a calculation error).

“Fuzzing” refers to a family of high-volume automated tests that vary the inputs but have no oracle. The test runs until the program crashes or fails in some other unmissable way.

Hostile Data Stream Testing

Alan Jorgensen tested for many types of security errors by taking a good file in a standard format (e.g. PDF) and corrupting by substituting one string in the file with another. In Jorgensen’s work, the new string was typically much longer or was syntactically different. Jorgensen would then open the now-corrupt file with the application under test. The program might reject the file as corrupt or accept it. If it accepted the file, Jorgensen could ask the program to use the file in some way (e.g. display it) and look for misbehavior. In some cases, Jorgensen could exploit a failure to recognize a corrupt file by embedding executable code that the program would then inappropriately execute. Jorgensen’s search for corruptions that would escape detection was an intensely automated activity. His analysis of exploitability was not.

There are several published variations of file fuzzing (variation of the contents or format of an input file) that are suited for different kinds of risks.

High-Volume Tests that Exploit the Availability of an Oracle

Having an oracle gives you a strong basis for high-volume automated testing. All that’s required is to generate inputs to the program under test and to drive the program to process the inputs. Given a result, you can use the oracle to check whether the program passed the test. No oracle is perfect: the program might fail the test even though its output matches the oracle’s. But even though you can’t learn everything that might possibly be interesting by relying on an oracle, you have a basis for running a boatload of tests that collectively check thoroughly for some types of failures and give you an opportunity to stumble onto some other types of failures (e.g. crashes from memory leaks) that you didn’t design the tests to look for. For any other ways that you can imagine the program might fail, you can design tailored manual tests as needed.

High-volume testing with an oracle won’t completely test the program (against all possible risks), but it will provide you with a level of coverage that you can’t achieve by hand.

Function Equivalence Testing

Function equivalence testing starts with an oracle–a reference function that should work the same way as the function under test. Given the reference function, you can feed inputs to the function under test, get the results, and check whether the reference function gives the same results. You can test with as many input values as you want, perhaps generating a large random set of inputs or (if you have enough available computer time) testing every possible input.

The final exam in my Programmer-Testing course illustrates function equivalence testing. The question that students have to answer is whether Open Office Calc does its calculations the same way as Microsoft Excel. We adopt a quality criterion: If Calc does it the same way as Excel, that’s good enough, even if Excel makes a few calculation errors.

To answer the question, the students:

  • test several functions individually
  • then test those functions together by testing formulas that combine several functions

To do this,

  • The students pick several individual functions in Calc
  • They test each by feeding random inputs to the Calc function and the same inputs to the corresponding Excel function.
  • Then they create random formulas that combine the functions, feed random data to the functions in the formula, and compare results.

If you test enough inputs, and the Calc results are always the same as Excel’s (allowing a little rounding error), it is reasonable to conclude that the calculation in Calc is equivalent to the calculation in Excel.

Constraint Checks

We use a constraint oracle to check for impossible values or impossible relationships.

For example an American ZIP code must be 5 or 9 digits. If you are working with a program that reads (or looks up or otherwise processes) ZIP codes, you can check every code that it processes. If it accepts (treats as a ZIP code) anything that has non-numeric characters or the wrong number of characters, then it has a bug. If you can find a way to drive the program so that it reads (or does whatever it does with) lots of ZIP codes, you have a basis for a set of high-volume automated tests.

Inverse Operations

Imagine taking a list that is sorted from high to low, sorting it low to high, then sorting it back (high to low). If you can give this program enough lists (enough sizes, enough diversity of values), you can eventually conclude that it sorts correctly (or not). Any operation that you can invert, you can build a high volume test series against.

State-Model Based Testing (SMBT)

If you have a state model (including a way to decide where the program should go if you give it an input and a way to determine whether the program actually got there), you can feed the program an arbitrarily long sequence of inputs and check the results.

I think the most common way to do SMBT is with a deterministic series of tests, typically the shortest series that will achieve a specified level of coverage. The typical coverage goal is every transition from each possible state to every state that it can reach next. You can turn this into a high-volume series by selecting states and inputs randomly and running the sequence for an arbitrarily long time. Ben Simo cautions that this has to be monitored because some programs will get locked into relatively tight loops, never reaching some states or some transitions. If you write your test execution system to check for this, though, you can force it out of the loop and into a not-yet hit state.

Diagnostics-Based Testing

I worked at a company that designed telephone systems (Telenova). Our phones gave customers a menu-driven interface to 108 voice features and 110 data features. Imagine running a system test with 100 phones calling each other, putting each other on hold, transferring calls from one phone to another, then conferencing in outside lines, etc. We were never able to create a full state model for our system tests. They were just too complex.

Instead, the Telenova staff (programmers, testers, and hardware engineers) designed a simulator that could drive the phones from state to state with specified or random inputs. They wrote probes into the code to check whether the system was behaving in unexpected ways. A probe is like an assert command, but if the program triggers the probe, it logs an error instead of halting the program. A probe might check whether a variable had an unexpected value, whether a set of variables had unexpected values relative to each other, whether the program went through a series of states in an unexpected order, etc.

Because these probes checked the internal state of the system, and might check any aspect of the system, we called them diagnostics.

Implementing this type of testing required a lot of collaboration. The programmers wrote probes into their code. Testers did the first evaluations of the test logs and did extensive troubleshooting, looking for simple replication conditions for events that showed up in the logs. Testers and programmers worked together to fix bugs, change the probes, and specify the next test series.

As challenging as the implementation was, this testing revealed a remarkable number of interesting problems, including problems that would have been very hard to find in traditional ways but had the potential to cause serious failures in the field. This is what convinced me of the value of high-volume automated testing.

High-Volume Tests that Exploit the Availability of Existing Tests or Tools

Sometimes, the best reason to adopt a high-volume automated technique is that the adoption will be relatively easy. That is, large parts of the job are already done or expensive tools that can be used to do the job are already in place.

Long-Sequence Regression Testing (LSRT)

For example, Pat McGee and I wrote about a well-known company that repurposed its huge collection of regression tests of its office-automation products’ firmware. (In deference to the company’s desire not to be named, we called it Mentsville.)

When you do LSRT, you start by running the regression tests against the current build. From that set, pick only tests that the program passes. Run these in random order until the program fails (or you’ve run them for a long enough time). The original regression tests were designed to reveal functional problems, but we got past those by using only tests that we knew the program could pass when you ran them one-at-a-time. The bugs we found came from running the tests in a long series. For example, the program might run a series of 1000 tests, thirty of them the same as the first (Test 1), but it might not fail Test 1 until that 30th run. Why did it fail on time 30 and not before?

  • Sometimes, the problem was a gradual build-up of bad data in the stack or memory.
  • Sometimes, the problem involved timing. For example, sometimes one processor would stay busy (from the last test) for an unexpectedly long time and wouldn’t be ready when it was expected for this test. Or sometimes the firmware would expect a location in memory to have been updated, but in this unusual sequence, the process or processor wouldn’t yet have completed the relevant calculation.
  • Sometimes the problem was insufficient memory (memory leak) and some of the leaks were subtle, requiring a specific sequence of events rather than a simple call to a single function.

These were similar to the kinds of problems we found at Telenova (running sequences of diagnostics-supported tests overnight).

Troubleshooting the failures was a challenge because it was hard to tell when the underlying failure actually occurred. Something could happen early in testing that wouldn’t cause an immediate failure but would cause gradual corruption of memory until hours later, the system crashed. That early result was what was needed for replicating the failure. To make troubleshooting easier, we started running diagnostics between tests, checking the state of memory, or how long a task had taken to execute, or whether a processor was still busy, or any of over 1000 other available system checks. We only ran a few diagnostics between tests. (Each diagnostic changed the state of the system. We felt that running too many diagnostics would change the state too much for us to get an accurate picture of the the effect of running several regression tests in a row.) But those few tests could tell us a great deal. And if we needed more information, we could run the same 1000-test sequence again, but with different diagnostics between tests.

High-Volume Protocol Testing

A protocol specifies the rules for communication between two programs or two systems. For example, an ecommerce site might interact with VISA to bill a customer’s VISA card for a purchase. The protocol specifies what commands (messages) the site can send to VISA, what their structure should be, what types of data should appear in the messages and where, and what responses are possible (and what they mean). A protocol test sends a command to the remote system and evaluates the results (or sends a related series of commands and evaluates the series of responses back).

With popular systems, like VISA, lots of programs want to test whether they work with the system (whether they’ve implemented the protocol correctly and how the remote system actually responds). A tool to run these types of tests might be already available, or pieces of it might be available. If enough is available, it might be easy to extend it, to generate long random sequences of commands to the remote system and to process the system’s responses to each one.

Load-Enhanced Functional Testing

Common lore back in the 1970′s was that a system behaved differently, functionally differently, when it was running under load. Tasks that it could perform correctly under “normal” load were done incorrectly when the system got busy.

Alberto Savoia described his experience of this in a presentation at the STAR conference in 2000. Segue (a test automation tool developer) built a service around it. As I understand their results, a system could appear 50%-busy (busy, but not saturated) but still be unable to correctly run some regression tests. The value of this type of testing is that it can expose functional weaknesses in the system. For example, suppose that a system running low on memory will rely more heavily on virtual memory and so the timing of its actions slows down. Suppose that a program running on the system spreads tasks across processors in a way that makes it vulnerable to race conditions. Suppose that Task 1 on Processor 1 will always finish before Task 2 on Processor 2 when the system is running under normal load, but under some heavy loads, Processor 1 will get tied up and won’t process work as quickly as Processor 2. If the program assumes that Task 1 will always get done before Task 2, that assumption will fail under load (and so might the program).

In the field, bugs like this produce hard-to-reproduce failures. Unless you find ways to play with the system’s timing, you might never replicate problems like this in the lab. They just become mystery-failures that a few customers call to complain about.

If you are already load-testing a program and if you already have a regression suite, then it might be very easy to turn this into a long-sequence regression test running with a moderate-load load test in parallel. The oracles include whatever comes with each regression test (to tell you whether the program passed that test or not) plus crashes, obvious long delays, or warnings from diagnostics (if you add those in).

A Few More Notes on History

I think that most of the high-volume work has been done in industry and not published in the traditional academic literature. For example, the best-known (best-publicized) family of high-volume techniques are “fuzzing”, attributed to Professor Barton Miller at the University of Wisconsin in 1988.

  • But I was seeing this industrial demonstrations of this approach when I first moved to Silicon Valley in 1983, and I’ve been told of applications as early as 1966 (the “Evil” program at Hewlett-Packard).
  • Long-sequence regression testing was initially developed in 1984 or 1985. I played a minor role in developing Telenova’s diagnostics-based approach in 1987; my impression was that we were applying ideas already implemented years ago at other telephone companies.
  • The testing staff at WordStar (a word-processing company) used a commercially available tool to feed long sequences of tests from one computer to another in 1984 or 1985. The first computer generated the tests and analyzed the test results. The second machine ran the program under test (WordStar). They were hooked up so that commands from the first machine looked like keyboard inputs to the second, and outputs that the second machine intended for its display actually went to the second machine for analysis. As an example of the types of bugs they found with this setup, they were able to replicate a seemingly-irreproducible crash that turned out to result from a memory leak. If you boldfaced a selection of text and then italicized it, there was a leak. If you italicized first, then applied bold, no leak. The test that exposed this involved a long random sequence of commands. I think this was a normal way to use that type of tool (long sequences with some level of randomization).
  • We also used random input generators in the telephone world. I think of a tool called The Hammer but there were earlier tools of this class. Hammer Technologies was formed in 1991 and I was hearing about these types of tools back in the mid-to-late-1980′s while I was at Telenova.

I’ve heard of related work at Texas Instruments, Microsoft, AT&T, Rolm, and other telephone companies, in the testing of FAA-regulated systems, at some auto makers, and at some other companies. I’m very confident that the work actually done is much more broadly spread than this small group of companies and includes many other techniques than I’ve listed here. However, most of what I’ve seen has been by semi-private demonstrations or descriptions or via demonstrations and descriptions at practitioner conferences. Most of the interesting applications that I have personally heard of have involved firmware or other software that controls hardware systems.

As far as I can tell, there is no common vocabulary for these techniques. Much of what has been published has been lost because many practitioner conference proceedings are not widely available and because old papers describing techniques under nonstandard names are just not being found in searches. I hope that we’ll fill in some of these gaps over the next several months. But even for the techniques that we can’t find in publications, it’s important to recognize how advanced the state of the practice was 25 years ago. I think we are looking for a way to make sometimes-closely-held industrial practices more widely known and more widely adopted, rather than inventing a new area.

This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.

 

Interactive Grading in University and Practitioner Classes: An Experience Report

January 14th, 2013

Summary: Graders typically work in private with no opportunity to ask even the simplest questions about a student’s submitted work. Interactive Grading is a technique that requires the student to participate in the grading of their work. The aim of this post is to share my experiences with interactive grading, with some tips for others who want to try it. I start with an overview and then provide three detailed examples, with suggestions for the conduct of the session. Let me stress one point: care must be exercised to keep the student comfortable and engaged, and not let the session degenerate into another lecture. In terms of results, most students told me that interactive grading was a good use of their time and that it helped them improve their performance. Several asked me to add more of it to my courses. A few viewed at as more indoctrination. In terms of impact on student grades, they showed marginal improvement.

Interactive grading is not a new idea. I think most instructors have done this for some students at some times. Certainly, I have. So when professor Keith Gallagher described it to me as an important part of his teaching, I didn’t initially understand its significance. I decided to try it myself after several of Keith’s students talked favorably about their classes with him. It wasn’t until I tried it that I realized what I had been missing.

That start came 15 months ago. Since then, I’ve used interactive grading in the online (professional-development) BBST classes, hybrid (online + face-to-face) university software testing classes, and a face-to-face university software metrics course. I’ve used it for exams, essays, and assignments (take-home complex tasks). Overall, it’s been a positive change.

These notes describe my personal experiences and reflections as an instructor. I emphasize this by writing in the very-obvious first person. Your experiences might be different.

What is Interactive Grading?

When I teach, I assign tasks and students submit their work to me for grading.

Usually I review student work in private and give them feedback after I have completed my review.

  • When I do interactive grading:
    • I meet with the student before I review the work.
    • I read the work for the first time while I meet with the student.
    • I ask the student questions, often open-ended questions that help me understand what the student was trying to say or achieve. Students often demonstrate that they understood the material better than their submitted work suggests. If they misunderstood part of the task, we can get to the bottom of the misunderstanding and they can try to demonstrate, during this meeting, their ability to do the task that was actually assigned.
    • I often coach the student, offering suggestions to improve the student’s strategy or demonstrating how to do parts of the task.
    • I typically show the student a grading rubric early in the meeting and assign the grade at the end of the meeting.
  • When I explicitly build interactive grading into my class:
    • It becomes part of the normal process, rather than an exception for a student who needs special attention. This changes the nature and tone of the discussions.
    • Every student knows well in advance what work will be interactively graded and that every student’s work will be handled the same way. This changes how they prepare for the meetings and how they interpret the meetings.
    • I can plan later parts of the course around the fact that the students have already had this experience. This changes my course design.

Costs and Benefits of Interactive Grading

Here is a summary of my conclusions. I’ll support this summary later in this report, with more detailed descriptions of what we did and what happened.

Costs

Interactive grading feels like it takes more time:

  • It takes time to prepare a grading structure that the students can understand (and therefore that I can use effectively when we have the meeting)
  • Scheduling can take a lot of time.
  • The meetings sometimes run long. It feels as though it takes longer to have the meeting than it would take to grade the work normally.

When I’ve checked my grading times for exams and assignments that I do in the traditional way, I think I actually spend the same amount of time (or more). I also do the same level of preparation. (Note: I do a lot of pre-grading preparation. The contrast might be greater for a less formal grader.)

As far as I can tell, the actual difference for me is not time, it is that interactive grading meetings are more stressful for me than grading in a quiet, comfortable home office. That makes it feel longer.

Benefits for the Students

During interactive grading, I can ask questions like,

  • What were you thinking?
  • What do you think this word in the question means? If I gave you a different explanation of what this word means, how would that affect your answer to the question?
  • Give me an example of what you are describing?
  • Can you give me a real-life example of what you are describing? For example, suppose we were working with OpenOffice. How would this come up in that project?
  • Can you explain this with a diagram? Show me on my whiteboard.
  • How would you answer this if I changed the question’s wording this way?
  • How would someone actually do that?
  • Why would anyone want to do that task that way? Isn’t there a simpler way to do the same thing?

I will raise the grade for a student who does a good job with these questions. I might say to the student,

“If I was only grading the written answer, you would get a ‘D’. But with your explanation, I am giving you a ‘B’. We need to talk about how you can present what you know better, so that you can get a ‘B’ again on the next exam, when I grade your written answers without an interactive supplement.”

If a student performs poorly on an exam (or an assignment, or an essay), the problem might be weak competence or weak performance.

  • A student who doesn’t know the material has no competence.
  • A student who knows the material but gives a poor answer on an exam despite that is showing poor performance. For example, you won’t get a good answer from a knowledgeable student who writes poorly or in a disorganized way or who misunderstands the question.

These focusing questions:

  • give the student who knows the material a chance to give a much better explanation or a much better defense of their answer.
  • give the student who knows the material a chance but performs poorly some indicators of the types of performance improvements they need to work on.
  • give the student who doesn’t know the material a clear form of feedback on why they are getting a poor grade.

Let’s consider competence problems. Students might lack the knowledge or the skills they are supposed to be learning for several reasons:

  • Some students simply don’t take the time or make the effort to do good work. Interactive grading probably won’t do much for them, beyond helping some of them understand the standards better.
  • Some students memorize words that they don’t really understand. The interactive grading discussion helps (some of) them understand a bit better the differences between memorized words and understanding something well enough to explain it in their own words and to explain how to do it or use it or why it’s important. It gives them a path to a different type of answer when they ask themselves while studying, “Do I know this well enough?”
  • Some students lack basic student-skills (how to study, how to look things up online, how to use the library, etc.) I can demonstrate these activities during the discussion, having the student do them with me. Students don’t become experts overnight with these, but as I’ll discuss below, I think this sometimes leads to noticeable improvements.

Some of the tasks that I assign to students can be done in a professional way. I choose tasks that I can demonstrate at a professional level of skill. In the give-and-take of interactive grading, I can say, “Let me show you how a professional would do that.” There is a risk of hijacking the discussion, turning it into yet-another-lecture. But judiciously used, this is a very personalized type of coaching of complex skills.

Now consider performance problems. The student understands the material but provides an exam answer or submits an assigned paper that doesn’t adequately reflect what they know, at their level of sophistication. A student with an “A” level of knowledge might look like a “C” student. A student with a “C” level of knowledge might look like a “D” or “F” student. These students are often puzzled by their poor grades. The interactive grading format makes it possible for me to show a student example after example after example of things they are doing (incomprehensible sentences, uninterpretable diagrams, confusing structure, confusing formatting, etc.) and how these make it hard on the reader. Here are two examples:

  • Some of my students speak English as a second language. Some speak/write it well; some are trying hard to communicate well but make grammar/spelling errors that don’t interfere with my ability to understand their writing. Some write sentences that I cannot understand. They expect me to guess their meaning and they expect me to bias my guessing in their favor. During interactive grading, I can read a sentence, realize that I can’t understand it, and then ask the student what it means. In some cases, the student had no idea (they were, as far as I can tell, bluffing, hoping I would give them points for nothing). In other cases, the student intended a meaning but they realized during the discussion that they were not conveying that meaning. For some students, this is a surprise. They didn’t realize that they were failing to communicate. Some have said to me that they had previously thought they were being downgraded for minor errors in their writing (spelling, grammar) rather than for writing something that the instructor could not understand. For some students, I think this changes their motivation to improve their writing.
  • Some students write disorganized answers, or answers that are organized in a fundamentally different way from the structure explicitly requested by the question. Some students do this strategically (with the goal of disguising their ignorance). Others are simply communicating poorly. In my experience, retraining these students is very hard, but this gives me another opportunity to highlight the problems in the work, to demonstrate how the problems affect how I analyze and evaluate their work, and how they could do it differently. (For more on this, see the discussion in the next section, Benefits for Me).

Benefits for Me

It’s easy to grade “A” work. The student “gets it” and so they get a high grade. When I recognize quickly that the student has met my objectives for a specific exam question, I stop analyzing it, award a very high grade, and move to the next question. Very fast.

In contrast, a typical “C” or “D” answer takes a lot longer to grade. The answer is typically disorganized, confused, hard to understand, rewords the question in an effort to present the question as its answer, seems to make inappropriate assumptions about what I should find obvious, has some mistakes, and/or has contradictions or inconsistencies that make the answer incoherent even though each individual component could be argued to be not-necessarily-wrong.

When I say, “disorganized”, I mean (for example) that if the question asks for parts (1), (2), (3), (4) and (5), the student will give three sections instead that address (in the first one) (1), (3) and (5), (in the second one) (1) and (2) and (in the third one) (3) and (5) with a couple of words from the question about part (4) but no added information.

I waste a lot of time trying to understand these answers. When I grade privately, I read a bad answer over several times, muttering as I try to parse the sentences and map the content to the question that was asked. It’s uncertain work–I constantly question whether I am actually understanding what the student meant and trying to figure out how much benefit of how much doubt I should give the student.

When I do interactive grading with a student, I don’t have to guess. I can say, “The question asked for Part 1. I don’t see an answer directly to Part 1. Can you explain how your answer maps to this part of the question?” I insist that the student map their actual words on the exam to the question that was asked. If they can’t sort it out for me, they can flunk. Next time, based on this experience, they can write in a way that will make it easier to map the answer to the question (make it easier for me to grade; get them a better grade).

This discussion is difficult, it can be unpleasant for the student and for me, I have to be diplomatic and somewhat encouraging or it will become a mess. But I don’t have to struggle alone with not understanding what was written. I can put the burden back on the student.

Some other benefits:

  • Students who write essays by copying content without understanding it demonstrate their cluelessness when we grade the essay interactively.
  • Students who cheat on a take-home exam (in my experience so far) avoid the interactive grading session, where they would probably demonstrate that they don’t understand their own answers.
  • Students who want to haggle with me about their grade have an opportunity to do so without driving me crazy. I now have a way to say, Show me what makes this a ‘B’ instead of arguing with them about individual points or about their situational need for a higher grade.

Maybe the most important benefits:

  • (Most) students tell me they like it and many of them ask to do it again (to transform a subsequent task into an interactive-graded one or to add interactive grading in another course)
  • Their performance seems to improve, sometimes significantly, which makes the next work easier to grade
  • This is highly personalized instruction. The student is getting one-on-one attention from the professor. Many students feel as though they don’t get enough personal attention and this addresses that feeling. It also makes some students a little more confortable with dropping by my office for advice at other times.

A More Detailed Report

Someone who is just trying to learn what interactive grading should stop here. What follows is “nuts and bolts”.

I’ve done interactive grading for three types of work:

  • midterm exams
  • practical assignments (homework that requires the student to apply something they have learned to a real-life task)
  • research essays

The overall process for interactive grading is the same for all three. But I’ve also noticed some differences. Marking exams isn’t the same as marking essays; neither is grading them interactively.

If you’re going to try interactive grading for yourself, the differences among descriptions of how it worked for these might help you make a faster and surer start.

Structural/Logistical Matters

I tell students what “interactive grading” is at the start of the course and I tell them which pieces of their work will be graded interactively. In most courses (in all of my university courses for credit), I tell them that this is a mandatory activity.

Some students choose not to participate in interactive grading sessions. Until now, I would tolerate this and grade their work in the traditional way. However, no one has ever given me a good reason for avoiding these (and I have run into several bad ones). The larger problem for me is that later in the course, when I want to do something that assumes that every student has had the interactive grading experience, it is inappropriate for students who skipped it. In the future, in academic courses, I will reinforce the “mandatory” nature of the activity by assigning a grade of zero on the work to a student who (after being warned of this) chooses not to schedule a grading meeting.

Scheduling the meetings is a challenge. To simplify it, I suggest a Doodle poll (www.doodle.com). Post the times you can be available and let each student sign up for a time (from your list) that s/he is available.

I schedule the meetings for 1.5 hours. They often end at 1 hour. Some drag on. If the meeting’s going to run beyond 2 hours, I usually force the meeting to a conclusion. If I think there will be enough value for the student, I offer to schedule a follow-up meeting to go over the rest of the work. If not, then either I announce the grade at the end of the session or I tell the student that I will grade the rest of the work in the traditional way and get back to them with the total.

During the meeting, the student sits across a desk from me. I use a computer with 3 monitors. Two of the monitors show the same information. I look at one and turn the other toward the student. Thus, the student and I can see the same things without having to crowd together to look over each other’s shoulder at one screen. The third screen faces me. I use it to look at any information that I don’t want to share with the student. I use 27″ monitors (small enough to fit on my desk, big enough to be readable for most students) with 1980×1020 resolution. This easily fits two readable word-processing windows side-by-side, such as the student’s submitted work in one window and the grading guide in the next window.

Example 1: Midterm Exams

In some of my courses, I give the students a list of questions well before the exam and draw the exam questions from the list. In my Software Testing 1 course, for example (the “Black Box Testing Course”), I give students a 100-question subset of this list: http://www.testingeducation.org/BBST/takingexams/ExamEssayQuestions2010.pdf

I outlined the costs and benefits of this approach at WTST 2003, in: http://kaner.com/pdfs/AssessmentTestingCourse.pdf. For our purposes, the most important benefit is that students have time before the exam to prepare an answer to each question. They can’t consult their prepared answers during the actual exam, but this lets them come to the exam well-prepared, with a clear idea of what the question means, how to organize the answer, and what points they want to make in it.

I typically give 2 or 3 midterms in a course and I am typically willing to drop the worst one. Thus the student can do poorly on the first midterm, use that experience to learn how to improve their study strategy and writing, and then do better on the next one(s). I do the interactive grading with the first midterm.

Students sometimes submit exams on paper (traditional, handwritten supervised exam), sometimes in a word processor file (supervised exam where students type at a university-owned computer in a university class/exam-room), sometimes in a word processor file (unsupervised takehome exam). For our purposes, assume that the student submitted an electronic exam.

Before I start grading any student’s work, I prepare a grading guide that identifies the types of information that I expect to see in the answer and the points that are possible for that particular type of info. If you’ve never seen that type of grading structure, look at my slides and videos (“How we grade exams”) at http://www.testingeducation.org/BBST/takingexams/.

Many of my questions are (intentionally) subject to some interpretation. Different students can answer them differently, even reaching contradictory conclusions or covering different technical information, but earn full points. The grading guide will allow for this, offering points for several different clusters of information instead of showing only One True Answer.

During the meeting, I rely frequently on the following documents, which I drag on and off of the shared display:

  • The student’s exam
  • The grading guide for that exam
  • The set of all of the course slides
  • The transcripts of the (videotaped) lectures (some or all of my lectures are available to the students on video, rather than given live)
  • The assigned papers, if any, that are directly relevant to exam questions.

We work through the exam one question at a time. The inital display is a copy of the exam question and a copy of the student’s answer. I skim it, often running my mouse pointer over the document to show where I am reading. If I stop to focus, I might select a block of text with the mouse, to show what I am working on now.

  • If the answer is well done, I’ll just say “good”, announce the grade (“That’s a 10 out of 10″) Then I skip to the next answer.
  • If the answer is confusing or sometimes if it is incomplete, I might start the discussion by saying to the student, “Tell me about this.” Without having seen my grading guide, the student tells me about the answer. I ask follow-up questions. For example, sometimes the student makes relevant points that aren’t in the answer itself. I tell the student it’s a good point and ask where that idea is in the written answer. I listed many of my other questions near the start of this report.
  • At some point, often early in the discussion, I display my grading guide beside the student’s answer, explain what points I was looking for, and either identify things that are missing or wrong in the student’s answer or ask the student to map their answer to the guide.
    • Typically, this makes it clear that the point is not in the answer, and what the grading cost is for that.
    • Some students try to haggle with the grading.
      • Some simply beg for more points. I ask them to justify the higher grade by showing how their answer provides the information that the grading guide says is required or creditable for this question.
      • Some tell me that I should infer from what they did write that they must have known the other points that they didn’t write about and so I should give them credit. I explain that I can only grade what they say, not what they don’t say but I think they know anyway.
      • Some agree that their answer as written is incomplete or unclear but they show in this meeting’s discussion that they understand the material much better than their answer suggests. I often give them additional credit, but we’ll talk about why it is that they missed writing down that part of the answer, and how they would structure an answer to a question like this in the future so that their performance on the next exam is better.
      • Some students argue with the analysis I present in the guide. Usually this doesn’t work, but sometimes I give strong points for an analysis that is different from mine but justifiable. I might add their analysis to the grading guide as another path to high points for that answer.
  • The student might claim that I am expecting the student to know a specific detail (e.g. definition or fact) that wasn’t taught in the course. This is when I search the slides and lecture transcripts and readings, highlighting the various places that the required information appeared. I try to lead from here to a different discussion–How did you miss this? What is the hole in your study strategy?
  • Sometimes at the end of the discussion, I ask the student to go to the whiteboard and present a good answer to the question. I do this when I think it will help the student tie together the ideas we’ve discussed, and from doing that, see how to structure answers better in the future. A good presentation (many of them are good enough) might take the assigned grade for the question from a low grade to a higher one. I might say to the student, “That’s a good analysis. Your answer on paper was worth 3/10. I’m going to record a 7/10 to reflect how much better a job you can actually do, but next time we won’t have interactive grading so you’ll have to show me this quality in what you write, not in the meeting. If you give an answer this good on the next exam, you’ll get a 9 or 10.

In a 1.5 hour meeting, we can only spend a few minutes on each question. A long discussion for a single question runs 20 minutes. One of my tasks is to move the discussion along.

Students often show the same weakness in question after question. Rather than working through the same thing in detail each time, I’ll simply note it. As a common example, I might say to a student

Here’s another case where you answered only 2 of the 3 parts of the question. I think you need to make a habit of writing an outline of your answer, check that the outline covers every part of the question, and then fill in the outline.

Other students were simply unprepared for the exam and most of their answers are simply light on knowledge. Once it’s clear that lack of preparation was the problem (often it becomes clear because the student tells me that was the problem), I speed up the meeting, looking for things to ask or say that might add value (for example complimenting a good structure, even though it is short on details). There is no value in dragging out the meeting. The student knows the work was bad. The grade will be bad. Any time that doesn’t add value will feel like scolding or punishment, rather than instruction.

The goal of the meeting is constructive. I am trying to teach the student how to write the next exam better. We might talk about how the student studied, how the student used peer review of draft answers, how the student outlined or wrote the answer, how the student resolved ambiguities in the question or in the course material — and how the student might do this differently next time.

Especially if the student achieved a weak grade, I remind the student that this is the first of three midterms and that the course grade is based on the best two. So far, the student has lost nothing. If they can do the next two well, they can get a stellar grade. For many students, this is a pleasant and reassuring way to conclude the meeting.

The statistical results of this are unimpressive. For example, in a recent course, 11 students completed the course.

  • On midterm 1 (interactively graded) their average grade was 76.7.
  • On midterm 2, the average grade was 81.5
  • On midterm 3, the average grade was 76.3
  • On the final exam, the average grade was 77.8

Remember that I gave students added credit for their oral presentation during interactive grading, which probably added about 10 points to the average grade for midterm 1. Also, I think I grade a little more strictly toward the end of the course and so a B (8/10) answer for midterm 2 might be a B- (7.5) for midterm 3 or the final. Therefore, even though these numbers are flat, my subjective impression of the underlying performance was that it was improving, but this was not a powerful trend.

At the end of the meetings, I asked students whether they felt this was a good use of their time and whether they felt it had helped them. They all told me that it did. In another class (metrics), some students who had gone through interactive grading in the testing course asked for interactive grading of the first metrics midterm. This seemed to be another indicator that they thought the meetings had been helpful and sufficiently pleasant experiences.

This is not a silver bullet, but the students and I feel that it was helpful.

Example 2: Practical Assignments

In my software testing class, I assign tasks that people would do in actual practice. Here are two examples that we have taken to interactive grading:

  1. The student joins the OpenOffice (OOo) project (https://blogs.apache.org/OOo/entry/you_can_help_us_improve) and reviews unconfirmed bug reports. An unconfirmed bug has been submitted but not yet replicated. The student tries to replicate it and adds notes to the report, perhaps providing a simpler set of steps to get to the failure or information that the failure shows up only on a specific configuration or with specific data. The student also writes a separate report to our class, evaluating the communication quality and the technical quality of the original report. An example of this assignment is here: http://www.testingeducation.org/BBST/bugadvocacy/AssignmentBugEvaluationv11.3.pdf
  2. The student picks a single variable in OpenOffice Writer (such as the number of rows in a table) and does a domain analysis of it. In the process, the student imagines several (about 20) ways the program could fail as a consequence of an attempt to assign a value to the variable or an attempt to use the variable once that value has been assigned. For each risk (way the program could fail), the student divides the values of the variable into equivalence classes (all the values within the same class should cause the test with that variable and that value to behave the same way) and then decides which one test should be used from each class (typically a boundary value). An example of this assignment is here: http://www.testingeducation.org/BBST/testdesign/AssignmentRiskDomainTestingFall2011.pdf

The Bug Assignment

When I meet with the student about the bug assignment, we review the bug report (and the student’s additions to it and evaluation of it). I start by asking the student to tell me about the bug report. In the discussion, I have several follow-up questions, such as

  • how they tried to replicate the report
  • why they stopped testing when they did
  • why they tested on the configurations they did
  • what happened when they looked for similar (often equivalent) bugs in the OOo bug database and whether they learned anything from those other reports,

In many cases, especially if the student’s description is a little confusing, I will bring up OpenOffice and try to replicate the bug myself. Everything I do is on the shared screen.They see what I do while I give a running commentary.

  • Sometimes I ask them to walk me through the bug. They tell me what to do. I type what they say. Eventually, they realize that the instructions they wrote into the bug report aren’t as clear or as accurate as they thought.
  • Sometimes I try variations on the steps they tried, especially if they failed to replicate the bug.

My commentary might include anecdotes about things that have happened (that I saw or did) at real companies or comments/demonstration of two ways to do essentially the same thing, with the second being simpler or more effective.

When I ask students about similar bugs in the database, most don’t know how to do a good search, so I show them. Then we look at the reports we found and decide whether any are actually relevant.

I also ask the student how the program should work and why they think so. What did they do to learn more about how the program should work? I might search for specifications or try other programs and see what they do.

I will also comment on the clarity and tone of the comments the student added to the OOo bug report, asking the student why they said something a certain way or why they included some details and left out others.

Overall, I am providing different types of feedback:

  • Did the student actually do the work that was assigned? Often they don’t do the whole thing. Sometimes they miss critical tasks. Sometimes they misunderstand the task.
  • Did the student do the work well? Bug reporting (and therefore this assignment) involves a mixture of persuasive technical writing and technical troubleshooting.
    • How well did they communicate? How much did they improve the communication of the original report? How well did they evaluate the communication quality of the original report?
    • How well did they troubleshoot? What types of information did they look for? What other types could they have looked for that would probably have been helpful? What parameters did they manipulate and how wise were their choices of values for those parameters?

Thinking again about the distinction between competence and performance,

  • Some students show performance problems (for example, they do the task poorly because they habitually follow instructions sloppily). For these students, the main feedback is about their performance problems.
  • Some students show competence problems–they are good at following instructions and they can write English sentences, but they need to learn how to do a better job of bug reporting. For these students, the main feedback is what they are already doing well and what professional-quality work of this type looks like.

This is a 4-phase assignment. I try to schedule the interactive grading sessions soon after Phase 1, because Phase 3 is a more sophisticated repetition of Phase 1 (similar task on a different bug report). The ideal case is Phase 1 — Feedback — Phase 3. Students who were able to schedule the sessions this way have told me that the feedback helped them do a much better job on Phase 3.

The Domain Testing Analysis

Every test technique is like a lens that you look through to see the program. It brings some aspects of the program into clear focus, and you test those in a certain way. It pretty much makes the other aspect of the program invisible.

In domain testing, you see a world of variables. Each variable stands out as a distinct individual. Each variable has sets of possible values:

  • The set of values that users might try to assign to the variable (the values you might enter into a dialog box, for example). Some of these values are invalid — the variable is not supposed to take on these values and the program should reject them.
  • The set of values that the variable might actually take on — other parts of the program will use this variable and so it is interesting to see whether this variable can take on any (“valid”) values that these parts of the program can’t actually handle.
  • The set of values that might be output, when this variable is displayed, printed, saved to disk, etc.

These sets overlap, but it is useful to recognize that they often don’t map perfectly onto each other. A test of a specific “invalid” value might be sometimes useful, sometimes uninteresting and sometimes impossible.

For most variables, each of these sets is large and so you would not want to test every value in the set. Instead, domain testing has you group values as equivalent (they will lead to the same test result) and then sample only one or two values from every set of equivalents. This is also called equivalence-class analyis and boundary testing.

One of the most challenging aspects of this task is adopting the narrow focus of the technique. People aren’t used to doing this. Even working professionals — even very skilled and experienced working professionals — may not be used to doing this and might find it very hard to do. (Some did find it very hard, in the BBST:Test Design course that I taught through the Association for Software Testing and in some private corporate classes.)

Imagine an assignment that asks you to generate 15 good tests that are all derived from the same technique. Suppose you imagine a very powerful test (or a test that is interesting for some other reason) that tests that variable of that program (the variable you are focusing on). Is it a good test? Yes. Is it a domain test? Maybe not. If not, then the assignment is telling you to ignore some perfectly good tests that involve the designated variable, maybe in order to generate other tests that look more boring or less powerful. Some people find this confusing. But this is why there are so many test techniques (BBST: Test Design catalogs over 100). Each technique is better for some things and worse for others. If you are trying to learn a specific technique (you can practice a different one tomorrow), then tests that are not generated by that technique are irrelevant to your learning, no matter how good they are in the general scheme of things.

We run into a conflict of intuitions in the practitioner community here. The view that I hold, that you see in BBST, is that the path to high-skill testing is through learning many different techniques at a high level of skill. To generate a diverse collection of good tests, use a diverse set of techniques. To generate a few tests that are optimized for a specific goal, use a technique that is appropriate for that goal. Different goals, different techniques. But to do this, you have to apply a mental discipline while learning. You have to narrow your focus and ask, “If I was a diehard domain tester who had no use for any other technique, how would I analyze this situation and generate my next tests?” Some people have told me this feels narrow-minded, that it is more important to train testers to create good tests and to get in touch with their inner creativity, than to cramp their style with a narrow vision of testing.

As I see it, this type of work doesn’t make you narrow-minded. It doesn’t stop you from using other techniques when you actually do testing at work. This is not your testing at work. It is your practice, to get good enough to be really good at work. Think of practicing baseball. When you are in batting practice, trying to improve your hitting, you don’t do it by playing catch (throwing and catching the ball), not even if it is a very challenging round of catch. That might be good practice, but it is not good batting practice.

The domain testing technique helps you select a few optimal test values from a much larger set of possibilities. That’s what it’s good for. We use it to pick the most useful specific values to test for a given variable. Here’s a heuristic: If you design a test that would probably yield the same result (pass/fail) now matter what value is in the variable, you are probably focused on testing a feature rather than a variable, and you are almost certainly not designing using domain testing.

Keeping in mind the distinction between competence and performance,

  • Some students show performance problems (for example, they do the task poorly because they habitually follow instructions sloppily). For these students, the main feedback is about their performance problems.
  • The students who show competence problems are generally having trouble with the idea of looking at the world through the lens of a single technique. I don’t have a formula for dealing with these students. It takes individualized questioning and example-creating, that doesn’t always work. Here are some of the types of questions:
    • Does this test depend on the value of the variable? Does the specific value matter? If not, then why do we care which values of the variable we test with? Why do a domain analysis for this?
    • Does this test mainly depend on the value of this variable or the value of some other variable?
    • What parts of the program would care if this variable had this value instead of that value?
    • What makes this specific value of the variable better for testing than the others?

From my perspective as the teacher, these are the hardest interactive grading discussions.

  • The instructions are very detailed but they are a magnet for performance problems that dominate the discussions with the weaker student.
  • The technique is not terribly hard, if you are willing to let yourself apply it in a straightforward way. The problem is that people are not used to explicitly using a cognitive lens, which is essentially what a test technique is.

Student feedback on this has been polite but mixed. Many students are enthusiastic about interactive grading and (say they) feel that they finally understood what I was talking about after the discussion that applied it to their assignment. Other students came away feeling that I had an agenda, or that I was inflexible.

Example 3: Research Essays

In my Software Metrics course, I require students to write two essays. In each essay, the student is required to select one (1) software metric and to review it. I give students an outline for the essay that has 28 sections and subsections, requiring them to analyze the validity and utility of the metric from several angles. They are to look in the research literature to find information for each section, and if they cannot find it, to report how they searched for that information (what search terms in which electronic databases) and summarize the results. Then they are to extrapolate from the other information they have learned to speculate what the right information probably is.

These are 4th year undergraduates or graduate students. A large percentage of these students lack basic library skills. They do not know how to do focused searches in electronic databases for scholarly information and they do not know how to assess its credibility or deal with conflicting results and conflicting conclusions. Many of them lack basic skills in citing references. Few are skilled at structuring a paper longer than 2 or 3 pages. I am describing good students at a well-respected American university that is plenty hard to get into. We have forced them to take compulsory courses in writing, but those courses only went so far and the students only paid so much attention. My understanding, having talked at length with faculty at other schools, is that this is typical of American computer science students.

My primary goal is to help students learn how to cut through the crap written about most metrics (wild claims pro and con, combined with an almost-shocking lack of basic information) so that they can do an analysis as needed on the job.

Metrics are important. They are necessary for management of projects and groups. Working with metrics will be demanded of most people who want to rise beyond junior-level manager or mid-level programmer and of plenty of people whose careers will dead-end below that level. But badly-used metrics can do more harm than good. Given that most (or all) of the software metrics are poorly researched, have serious problems of validity (or at least have little or no supporting evidence of validity), and have serious risk of causing side-effects (measurement dysfunction) to the organization that uses them, there is no simple answer to what basket of metrics should be used for a given context. On the other hand, we can look at metrics as imperfect tools. People can be pretty good at finding information if they understand what they are looking for and they understand the strengths and weaknesses of their tools. And they can be pretty good at limiting the risks of risky things. So rather than encouraging my students to adopt a simplistic, self-destructive (or worse, pseudo-moralistic) attitudeof rejection of metrics, I push them to learn their tools, their limits, their risks, and some ways to mitigate risk.

My secondary goal is to deal with the serious performance problems that these students have with writing essays.

I assign two essays so that they can do one, get detailed feedback via interactive grading, and then do another. I have done this in one (1) course.

Before the first essay was due, we had a special class with a reference librarian who gave a presentation on finding software-metrics research literature in the online databases of Florida Tech’s library and we had several class discussions on the essay requirements.

The results were statistically unimpressive: the average grade went from 74.0 to 74.4. Underneath the numbers, though, is the reality that I enforced a much higher standard on the second essay. Most students did better work on Essay 2, often substantially.

The process for interactive grading was a little different. I kept the papers for a week before starting interactive grading, so that I could check them for plagiarism. I use a variety of techniques for this (if you are curious, see the video course on plagiarism-detection at http://www.testingeducation.org/BBST/engethics/). I do not do interactive grading sessions with plagiarists. The meeting with them is a disciplinary meeting and the grade is zero.

Some students skirted the plagiarism line, some intentionally and others not. In the interactive grading session, I made a point of raising this issue, giving these students feedback on what I saw, how it could be interpreted and how/why to avoid it in the future.

For me, doing the plagiarism check first is a necessary prerequisite to grading an essay. That way, I have put the question behind me about whether this is actually the student’s work. From here, I can focus on what the student did well and how to improve it.

A plagiarism check is not content-focused. I don’t actually read the essay (or at least, I don’t note or evaluate its content, I ignore its structure, I don’t care about its style apart from looking for markers of copying, and I don’t pay attention to who wrote it). I just hunt for a specific class of possible problems. Thus, when I meet with the student for interactive grading, I am still reading the paper for the first time.

During the interactive grading session, I ask the student to tell me about the metric.

One of the requirements that I impose on the students (most don’t do it) is that they apply the metric to real code that they know, so that they can see how it works. As computer science students, they have code, so if they understand the metric, they can do this. I ask them what they did and what they found. In the rare case that the student has actually done this, I ask about whether they did any experimenting, changing the code a little to see its effect on the metric. With some students, this leads to a good discussion about investigating your tools. With the students who didn’t do it, I remind them that I will be unforgiving about this in Essay 2. (Astonishingly, most students still didn’t do this in Essay 2. This cost each of them a one-letter-grade (10-point) penalty.)

From there, I typically read the paper section by section, highlighting the section that I am reading so that the student can follow my progress by looking at the display. Within about 1/2 page of reading, I make comments (like, “this is really well-researched”) or ask questions (like, “this source must have been hard to find, what led you to it?”)

The students learned some library skills in the meeting with the librarian. The undergrads have also taken a 1-credit library skills course. But most of them have little experience. So at various points, it is clear that the student has not been able to find any (or much) relevant information. This is sometimes an opportunity for me to tell the student how I would search for this type of info, and them demonstrate that search. Sometimes it works, sometimes I find nothing useful either.

In many cases, the student finds only the cheerleading of a few people who created a metric or who enthuse about it in their consulting practices. In those reports, all is sunny and light. People might have some problems trying to use the metric, but overall (these reports say), someone who is skilled in its use can avoid those problems and rely on it. In the discussion, I ask skeptical questions. I demonstrate searches for more critical commentary on the metric. I point out that many of the claims by the cheerleaders lack data. They often present summaries of not-well-described experiments or case studies. They often present tables with numbers that came from not-exactly-clear-where. They often present reassuring experience reports. Experience reports (like the one you are reading right now) are useful. But they don’t have a lot of evidence-value. A well-written experience report should encourage you to try something yourself, giving you tips on how to do that, and encourage you to compare your experiences with the reporter’s. For experience reports to provide enough evidence of some claim to let you draw some conclusions about it, I suggest that you should require several reports that came from people not associated with each other and that made essentially the same claim, or had essentially the same thing/problem arise, or reached the same conclusion. I also suggest that you should look for counter-examples. During the meeting, I might make this same point and then hunt for more experience reports, probably in parallel with the student who is on their own computer while I run the search on mine. This type of attack on the data foundation of the metric is a bit familiar to these students because we use Bossavit’s book The Leprechauns of Software Engineering: How folklore turns into fact and what to do about it, as one of our texts.

In the essay, all students have performance problems (that is, all students have problems with the generic task of writing essays and the generic task of finding information) and so (in my extensive experience — 1 course), the discussion with all students switches back and forth between the content (the analysis of the metric and of the research data) and the generic parts of the assignment.

A Few More Thoughts

You can use interactive grading with anything that you can review. For example, Keith Gallagher uses interactive grading in his programming courses, reading the student’s code with them and asking questions about the architecture, style, syntax, and ability to actually provide the benefits the program is supposed to provide. His students have spoken enthusiastically about this to me.

Dr. Gallagher ends his sessions a little differently than I do. Rather than telling students what their grade is, he asks them what they think their grade should be. In the discussion that follows, he requires the student to view the work from his perspective, justifying their evaluation in terms of his grading structure for the work. It takes some skill to pull this off, but done well, it can give a student who is disappointed with their grade even more insight into what needs improvement.

Done with a little less skill, interactive grading can easily become an interaction that is uncomfortable or unpleasant for the student. It can feel to a student who did poorly as if the instructor is bullying them. It can be an interaction in which the instructor does all the talking. If you want to achieve your objectives, you have to intentionally manage the tone of the meeting in a way that serves your objectives.

Overall, I’m pleased with my decision to introduce interactive grading to my classes. As far as I can tell, interactive grading takes about the same amount of time as more traditional grading approaches. There are several benefits to students. Those with writing or language problems have the opportunity to demonstrate more skill or knowledge than their written work might otherwise suggest. They get highly personalized, interactive coaching that seems to help them submit better work on future assignments. Most students like interactive grading and consider it a worthwhile use of their time. Finally, I think it is important to have students feel they are getting a good value for their tuition dollar. This level of constructive personal attention contributes to that feeling.

This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.