Updating some core concepts in software testing

November 21st, 2006

Most software testing techniques were first developed in the 1970’s, when “large� programs were tiny compared to today.

Programmer productivity has grown dramatically over the years, a result of paradigmatic shifts in software development practice. Testing practice has evolved less dramatically and our productivity has grown less spectacularly. This divergence in productivity has profound implications—every year, testers impact less of the product. If we continue on this trajectory, our work will become irrelevant because its impact will be insignificant.

Over the past few years, several training organizations have created tester certifications. I don’t object in principle to certification but the Body of Knowledge (BoK) underlying a certificate has broader implications. People look to BoKs as descriptions of good (or at least current) attitudes and practice.

I’ve been dismayed by the extent to which several BoKs reiterate the 1980’s. Have we really made so little progress?

When we teach the same basics that we learned, we provide little foundation for improvement. Rather than setting up the next generation to rebel against the same dumb ideas we worked around, we should teach our best alternatives, so that new testers’ rebellions will take them beyond what we have achieved.

One popular source of orthodoxy is my own book, Testing Computer Software (TCS). In this article, I would like to highlight a few of the still-influential assumptions or assertions in TCS, in order to reject them. They are out of date. We should stop relying on them.

Where TCS was different

I wrote TCS to highlight what I saw as best practices (of the 1980’s) in Silicon Valley, which were at odds with much of the received wisdom of the time:

  • Testers must be able to test well without authoritative (complete, trustworthy) specifications. I coined the phrase, exploratory testing, to describe a survival skill.
  • Testing should address all areas of potential customer dissatisfaction, not just functional bugs. Because matters of usability, performance, localizability, supportability, (these days, security) are critical factors in the acceptability of the product, test groups should become skilled at dealing with them. Just because something is beyond your current skill set doesn’t mean it’s beyond your current scope of responsibility.
  • It is neither uncommon nor unethical to defer (choose not to fix) known bugs. However, testers should research a bug or design weakness thoroughly enough and present it carefully enough to help the project team clearly understand the potential consequences of shipping with this bug.
  • Testers are not the primary advocates of quality. We provide a quality assistance service to a broader group of stakeholders who take as much pride in their work as we do.
  • The decision to automate a test is a matter of economics, not principle. It is profitable to automate a test (including paying the maintenance costs as the program evolves) if you would run the manual test so many times that the net cost of automation is less than manual execution. Many manual tests are not worth automating because they provide information that we don’t need to collect repeatedly.
  • Testers must be able to operate effectively within any software development lifecycle—the choice of lifecycle belongs to the project manager, not the test manager. In addition, the waterfall model so often advocated by testing consultants might be a poor choice for testers because the waterfall pushes everyone to lock down decisions long before vital information is in, creating both bad decisions and resistance to later improvement.
  • Testers should design new tests throughout the project, even after feature freeze. As long as we keep learning about the product and its risks, we should be creating new tests. The issue is not whether it is fair to the project team to add new tests late in the project. The issue is whether the bugs those tests could find will impact the customer.
  • We cannot measure the thoroughness of testing by computing simple coverage metrics or by creating at least one test per requirement or specification assertion. Thoroughness of testing means thoroughness of mitigation of risk. Every different way that the program could fail creates a role for another test.

The popularity of these positions (and the ones they challenged) waxes and wanes, but at least they are seen as mainstream points of view.

Where TCS 3 would (will) be different

TCS Editions 1 and 2 were written in a moderate tone. In retrospect, my wording was sometimes so gentle that readers missed key points. In addition, some of TCS 2’s firm positions were simply mistaken:

  • It is not the primary purpose of testing to find bugs. (Nor is it the primary purpose of testing to help the project manager make decisions.) Testing is an empirical investigation conducted to provide stakeholders with information about the quality of the software under test. Stakeholders have different informational needs at different times, in different situations. The primary purpose of testing is to help those stakeholders gain the information they need.
  • Testers should not attempt to specify the expected result of every test. The orthodox view is that test cases must include expected results. There are many stories of bugs missed because the tester simply didn’t recognize the failure. I’ve seen this too. However, I have also seen cases in which testers missed bugs because they were too focused on verifying “expectedâ€? results to notice a failure the test had not been designed to address. You cannot specify all the results—all the behaviors and system/software/data changes—that can arise from a test. There is value in documenting the intent of a test, including results or behaviors to look for, but it is important to do so in a way that keeps the tester thinking and scanning for other results of the test instead of viewing the testing goal as verification against what is written.
  • Procedural documentation probably offers little training value. I used to believe testers would learn the product by following test scripts or by testing user documentation keystroke by keystroke. Some people do learn this way, but others (maybe most) learn more from designing / running their own experiments than from following instructions. In science education, we talk about this in terms of the value of constructivist and inquiry-based learning. There’s an important corollary to this that I’ve learned the hard way—when you create a test script and pass it to an inexperienced tester, she might be able to follow the steps you intended, but she won’t have the observational skills or insights that you would have if you were following the script instead. Scripts might create a sequence of actions but they don’t create cognition.
  • Software testing is more like design evaluation than manufacturing quality control. A manufacturing defect appears in an individual instance of a product (like badly wired brakes in a car). It makes sense to look at every instance in the same ways (regression tests) because any one might fail in a given way, even if the one before and the one after did not. In contrast, a design defect appears in every instance of the product. The challenge of design QC is to understand the full range of implications of the design, not to look for the same problem over and over.
  • Testers should not try to design all tests for reuse as regression tests. After they’ve been run a few times, a regression suite’s tests have one thing in common: the program has passed them all. In terms of information value, they might have offered new data and insights long ago, but now they’re just a bunch of tired old tests in a convenient-to-reuse heap. Sometimes (think of build verification testing), it’s useful to have a cheap heap of reusable tests. But we need other tests that help us understand the design, assess the implications of a weakness, or explore an issue by machine that would be much harder to explore by hand. These often provide their value the first time they are run—reusability is irrelevant and should not influence the design or decision to develop these tests.
  • Exploratory testing is an approach to testing, not a test technique. In scripted testing, a probably-senior tester designs tests early in the testing process and delegates them to programmers to automate or junior testers to run by hand. In contrast, the exploratory tester continually optimizes the value of her work by treating test-related learning, test design, test execution and test result interpretation as mutually supportive activities that run in parallel throughout the project. Exploration can be manual or automated. Explorers might or might not keep detailed records of their work or create extensive artifacts (e.g. databases of sample data or failure mode lists) to improve their efficiency. The key difference between scripting and exploration is cognitive—the scripted tester follows instructions; the explorer reinvents instructions as she stretches her knowledge base and imagination.
  • The focus of system testing should shift to reflect the strengths of programmers’ tests. Many testing books (including TCS 2) treat domain testing (boundary / equivalence analysis) as the primary system testing technique. To the extent that it teaches us to do risk-optimized stratified sampling whenever we deal with a large space of tests, domain testing offers powerful guidance. But the specific technique—checking single variables and combinations at their edge values—is often handled well in unit and low-level integration tests. These are much more efficient than system tests. If the programmers are actually testing this way, then system testers should focus on other risks and other techniques. When other people are doing an honest and serious job of testing in their way, a system test group so jealous of its independence that it refuses to consider what has been done by others is bound to waste time repeating simple tests and thereby miss opportunities to try more complex tests focused on harder-to-assess risks.
  • Test groups should offer diverse, collaborating specialists. Test groups need people who understand the application under test, the technical environment in which it will run (and the associated risks), the market (and their expectations, demands, and support needs), the architecture and mechanics of tools to support the testing effort, and the underlying implementation of the code. You cannot find all this in any one person. You can build a group of strikingly different people, encourage them to collaborate and cross-train, and assign them to project areas that need what they know.
  • Testers may or may not work best in test groups. If you work in a test group, you probably get more testing training, more skilled criticism of your tests and reports, more attention to your test-related career path, and stronger moral support if you speak unwelcome truths to power. If you work in an integrated development group, you probably get more insight into the development of the product, more skilled criticism of the impact of your work, more attention to your broad technical career path, more cross-training with programmers, and less respect if you know lots about the application or its risks but little about how to write code. If you work in a marketing (customer-focused) group, you probably get more training in the application domain and in the evaluation of product acceptability and customer-oriented quality costs (such as support costs and lost sales), more attention to a management-directed career path, and more sympathy if programmers belittle you for thinking more like a customer than a programmer. Similarly, even if there is a cohesive test group, its character may depend on whether it reports to an executive focused on testing, support, marketing, programming, or something else. There is no steady-state best place for a test group. Each choice has costs and benefits. The best choice might be a fundamental reorganization every two years to diversify the perspectives of the long-term staff and the people who work with them.
  • We should abandon the idea, and the hype, of best practices. Every assertion that I’ve made here has been a reaction to another that is incompatible but has been popularly accepted. Testers provide investigative services to people who need information. Depending on the state of their project, the ways in which the product is being developed, and the types of information the people need, different practices will be more appropriate, more efficient, more conducive to good relations with others, more likely to yield the information sought—or less.

This paper has been a summary of a talk I gave at KWSQA last month and was written for publication in their newsletter. For additional details, see my paper, The Ongoing Revolution in Software Testing, available at www.kaner.com/pdfs/TheOngoingRevolution.pdf.

A new five-year plan

July 28th, 2006

I manage my career (and much of the rest of my life) with 5-year plans. This one (my sixth, I think) is a bit late. I adopted my last one in 1999 so the next was due in 2004. It took longer to converge than I expected but, at last, I think it’s done.

The 2006 plan is an incremental shift from 1999.

  • Since 1999, I’ve developed a new instructional model for teaching that seems to work well at school and might transfer well back to the workplace.

I think this can be a foundation for an open courseware curriculum in software testing that people can self-study, study informally with peers or as a workgroup on the job, or in guided online courses. I think we can build an open courseware community to support, develop and facilitate courses like these.

  • Since 1999, the biggest innovation in commercial software testing education has been in marketing–more people are using “certification” to sell courses, sometimes at stunningly high fees. Pay a lot of money, take a short course that helps you learn how to pass the test, pay for a test that some organization with a fancy name approves (maybe an organization that was created for this purpose) and get a fancy certificate that some employers would treat as authoritative evidence of competence in testing. I went to some meetings of trainers about this, read a lot of email, decided that the schemes that I saw didn’t have much merit but probably wouldn’t do much harm either. But they’ve been making a lot of money, and with that, they’ve gotten a lot better at marketing. Hey, Villanova University even offers a MASTER CERTIFICATE IN SOFTWARE TESTING. It looks pretty official, from a university that emphasizes its accredition on the same page (it is accredited, but I doubt that the Certificate Programs fall within the scope of the academic accrediting reviews). Look closely and this Master Certification is just a couple of short courses that lead to an industry certification.
    • This is great marketing. But what does it do for the field?
    • Do you really achieve competence in testing by taking a couple of commercial short courses in testing and passing a test?
    • Is competent testing really so trivial that you can develop competence in only a couple of commercial short courses?
    • Can you develop skill in these courses, or measure skill in these tests? How could you? If not, why should we treat these as indicators of competence in a skilled field?

I think it’s time to create a new model for certification in the field that focuses more on the benefit to the community and the advancement of the field and less on the marketing advantages. Some colleagues and I have started a project to create Open Certification for software testing.

  • One of the cool ideas of the 1990’s was test-first programming. I started teaching this at Florida Tech as part of our Testing 2 course, on programmer testing, with two colleagues/students (Andy Tinkham and Pat McGee). Part of what I’ve been learning from that course is how to teach programmers how to test their own code. But the breakthrough for me was that teaching test-first programming is a great way to teach new programmers how to program. So, I’ve started teaching first year students how to program in Java, requiring them to use Eclipse and JUnit.
    • I’ve taught this course once so far, had some good experiences with students who had failed a previous programming course.
    • I don’t think test-first programming is for everyone–I think there are strong cognitive style differences among people and that test-first probably works best for folks who learn better by doing than by reading or listening and for people who think more effectively when they build from the details up than from an abstract model down. But that includes a lot of people, and universities squander (flunk out or drop out) about 40% of the students who take first-year programming.

I think there are a lot of good, diligent, motivated black-box testers who want to learn how to program and how to apply that to their testing. I think that the next generation of testers will have to have programming skills. Test-first might be the way to help some of our generation develop skills that have eluded them. Can we create useful instructional materials for test-first programming that support self-study online? I don’t know, but I have some ideas that might help.

  • As a final note, it might be time for me to leave Florida Tech. There are a lot of reasons to stay–it’s been a good home so far–but Becky just finished her Ph.D. and her career opportunities in Florida look limited. I would also enjoy working in a community that had more of an IT base than a defense contractor base and in a school that had a stronger emphasis on social science (and thus more opportunity to collaborate with those researchers), even if this meant sacrificing some of the excellence in science and engineering that I’ve enjoyed at Florida Tech.

The core of the 2006-2010 work plan is to help improve the state of the practice in software testing, by building the community and developing an online curriculum in software testing that is accessible everywhere in the world (that probably means free, or cheap) and supports significant growth in skill and insight. It’s an ambitious goal. I won’t fully achieve it, certainly not all of it in 5 years, even with an increasing group of collaborators who have inspired parts of this vision and/or will take parts beyond what I could ever achieve. But I do think we can make a lot of progress against it.

HERE’S SOME BACKGROUND
In 1999, I had to choose an emphasis–law or computing:

  • I define my career in terms of a field, “safety and satisfaction of software customers and developers.” Legal, psychological, technological and commercial considerations come up in almost every issue in that field, so I end up doing work in all of those other areas.
    • To develop competence as a lawyer (while managing 40 people during the day), I had to reallocate my discretionary time from programming to law. Then I graduated, got caught up in the development of the Uniform Computer Information Transactions Act, the Uniform Electronic Transactions Act (which became the federal law, E-SIGN), and an update to Article 2 (law of sales) of the Uniform Commercial Code, and wrote Bad Software (by far my best, but least commercially successful, book).
    • To support the legislative work, I shifted my business model from work as a software development consultant and as an attorney who specializes in computer-related commercial law, to include a lot more commercial training. The training brought in far more income and more client appreciation than anything else, but it also exposed some interesting challenges that I hadn’t considered. I’ll come back to that.
  • The legal work was satisfying, I had an impact–even got elected to the American Law Institute–but it didn’t pay the bills. I had to either shift full-time into law or return focus to the technical work. My technical training and consulting paid well, but there were technologies that I had to catch up on, and problems that obviously belonged in my area if I wanted to claim to be continuing in a serious way in the field. They needed a full-time commitment too.

Ultimately, I decided that I could do more good, and have more fun, focusing on software than law. As a lawyer, in the current polical climate, my work would be defined more by the bad proposals I would fight against (some have become law; maybe I could have had a small impact on some of their details). As a technologist, I could wrestle with some complex educational problems and maybe develop a new model that could benefit the field.

The problem is that we need a new generation of software testing architects and we don’t know how to foster them.

Most of the widely used testing techniques, much of the current “tester attitude” were widespread back in 1983, when I got my first job in testing. Sadly, most of today’s commercial training in testing — including the industry standards and the syllabi for certification of software testers–are also stuck in 1983 (or maybe the 1970’s, when the key techniques and attitudes really developed).

Unfortunately, the technological and societal contexts have changed. In the 1970’s, when I was learning software engineering, a large computer program was 10,000 lines of code. Business code was often written in COBOL, a language designed to be readable even to business managers. A tester who really wanted to could read ALL of the relevant parts of the program, identify ALL of the relevant variables, and probably trace most or all of the information flows. These days, even your cell phone probably has over a million lines of code (several have over 5 million).

It’s great to lovingly handcraft every test, but in the face of programs this large, the tester gets (proportionally) so little done that even if the technological risks were the same now as before (different risks call for different types of tests), the tester who does things the old way will find her work irrelevant simply because she can’t do enough in the time available to make a difference to the quality of this large a program.

We need new models that improve our productivity by several orders of magnitude.

Sadly, the most popular test automation tools perpetuate the old styles of design and test and improve productivity only marginally. People spend big bucks for the small improvements in productivity and project control, and these improvements are often not realized with GUI regression automation. Some of the tools–test case managers that want full and independent descriptions of each case–probably make things worse.
To break out of the rut, we need to think in new ways, to use the technology in new ways. For that, we need advances in theory and practice that go beyond what testers can easily learn and reinvent on their own (or with a few friends) on the job.

Traditional commercial training provides no (0.00) help with this. Commercial short courses are fine vehicles for exposing people to new ideas, but not for developing new skills or exploring new insights in depth. People need time to play with new ideas, challenge them, try them out, learn how to apply them, and for some of the brightest, learn how to go beyond them to the next ideas on the new path. There is no time for this in a three-day or five-day course.

Universities provide a much stronger environment for this deeper and transformative learning.

Unfortunately, universities don’t provide much education in software testing–certainly not enough to foster a paradigm shift.
In 1999, I decided to find a university that would welcome efforts on software testing education that were focused on having an impact of the practice of software testing in the field. That was the core of the 5-year plan. In 2000, Florida Tech welcomed me, I moved to Melbourne, and have had six fascinating years learning (and help others learn) a new set of skills.

Software Customer Bill of Rights

August 27th, 2003

As the software infrastructure has been going through chaos, reporters (and others) have been called me several times to ask what our legal rights are now and whether we should all be able to sue Microsoft (or other vendors who ship defective software or software that fails in normal use).

Unfortunately, software customer rights have eroded dramatically over the last ten years. Ten years ago, the United States Court of Appeals for the Third Circuit flatly rejected a software publisher’s attempts to enforce contract terms that it didn’t make available to the customer until after the customer ordered the software, paid for it, and took delivery. Citing sections of Uniform Commercial Code’s Article 2 (Law of Sales) that every law student works through in tedious detail in their contracts class, the Court said that the contract for sale is formed when the customer agrees to pay and the seller agrees to deliver the product. Terms presented later are proposals for modification to the contract. The customer has the right to keep the product and use it under the original terms, and refuse to accept the new, seller favorable terms. Other courts (such as the United States Court of Appeals for the First Circuit) cited this case as representative of the mainstream interpretation of Article 2. Under this decision, and several decisions before it, shrinkwrapped contracts and clickwrapped contracts (the ones you have to click “OK” to in order to install the product) would be largely unenforceable.

The software publishing community started aggressively trying to rewrite contract law in about 1988, after the United States Court of Appeal for the Fifth Circuit rejected a shrinkwrapped restriction on reverse engineering. That effort resulted in the Uniform Computer Information Transactions Act and a string of court decisions, starting in 1995, that make it almost impossible to hold a software company liable for defects in its product (unless the defect results in injury or death)– even defects that it knew about when it shipped the product — and also very difficult to hold a mass-market seller liable for false claims about its product. (For background, see InfoWorld and Kaner’s Software Engineering & UCITA in the section on Forcing Products Liability Suits into Arbitration).

So what should we do about this? There are some strong feelings to hold companies fully accountable for losses caused by their products’ defects.

I’d rather stand back from the current crisis, consider the legal debates over the last 10 years, and make some modest suggestions that could go a long way toward restoring integrity and trust — and consumer confidence, consumer excitement, and sales — in this stalled marketplace.

1. Let the customer see the contract before the sale. It should be easy for customers of mass-market software products and computer information contracts to compare the contract terms for a product, or for competing products, before they download, use, or pay for a product. (NOTE: This is not a radical principle. American buyers of all types of consumer products that cost more than $15 are entitled to see the contract (at a minimum, the warranties in the contract) before the sale).

2. Disclose known defects. The software company or service provider must disclose the defects that it knows about to potential customers, in a way that is likely to be understood by a typical member of the market for that product or service.

3. The product (or information service) must live up to the manufacturer’s and seller’s claims. A statement by the vendor (manufacturer or seller) about the product that is intended to describe the product to potential customers is a warranty, a promise that the product will work as described. Warranties by sellers are defined in UCC Article 2 Section 313. Manufacturer liability is clarified (manufacturers are liable for claims they make in ads and in the manual) in a set of clarifying amendments to Article 2 that have now been approved by the Permanent Editorial Board for the UCC, which will be probably introduced in state legislatures starting early in 2004. In addition, it is a deceptive trade practice in most states (perhaps all) to make claims about the product that are incorrect and make the product more attractive. For example, under the Uniform Deceptive Trade Practices Act, Section 2(5) it is unlawfully deceptive to represent “that goods or services have sponsorship, approval, characteristics, ingredients, uses, benefits, or quantities that they do not have.” UCITA was designed to pull software out of the scope of laws like this, which it did by defining software transactions as neither goods nor services but licenses. We should get rid of this cleverly created ambiguity.

4. User has right to see and approve all transfers of information from her computer. Before an application transmits any data from the user’s computer, the user should have the ability to see what’s being sent. If the message is encrypted, the user should be shown an unencrypted version. On seeing the message, the user should be able to refuse to send it. This may cause the application to cancel a transaction (such as a sale that depends on transmission of a valid credit card number), but transmission of data from the user’s machine without the user’s knowledge or in spite of the user’s refusal should be prosecutable as computer tampering.

5. A software vendor may not block customer from accessing his own data without court approval.

6. A software vendor may not prematurely terminate a license without court approval. The issue of vendor self-help (early termination of a software contract without a supporting court order) was debated at great length through the UCITA process. To turn off a customer’s access to software that runs on the customer’s machine, the vendor should get an injunction (a court order). However, perhaps a vendor should be able to deny a customer access to software running on the vendor’s machine without getting an injunction (though the unfairly-terminated customer should be allowed to get a court order to restore its access.)

7. Mass-market customers may criticize products, publish benchmark study results, and make fair use of a product. Some software licenses bar the customer from publishing criticisms of the product, or publishing comparisons of this product with others or using screenshots or product graphics to satirize or disparage the product or the company. Under the Copyright Act, you are allowed to reproduce part of a copyrighted work in order to criticize it, comment on it, teach from it, and so on. Software publishers shouldn’t be able to use “license” contracts to bar their mass-market customers from the type of free speech that the Federal laws (including the Copyright Act) have consistently protected.

8. The user may reverse engineer the software. Software licenses routinely ban reverse engineering, but American courts routinely say that reverse engineering is fair use, permissible under the Copyright Act. Recently, California courts have started enforcing no-reverse-engineering bans in software licenses. This is a big problem. Software publishers claim that reverse engineering is a way to steal their work. There are many legitimate, important uses of reverse engineering, such as exposing security holes in the software, exposing and fixing bugs (that the manufacturer might not fix because it is unwilling, unable, or no longer in business), exposing copyright violations or fraudulent claims by the manufacturer, or achieving interoperability (making the product work with another product or device). These benefit or protect the customer but do not help anyone unfairly compete with the manufacturer.

9. Mass-market software should be transferrable. Under the First Sale Doctrine, someone who buys a copyrighted product (like a book) can lend it, sell it, or give it away without having to get permission of the original publisher or author. Similarly, if you buy a car, you don’t have to get the car manufacturer’s permission to lend, sell, or donate your car. UCITA Section 503(2)allows mass-market software publishers to take away their customers’ rights to transfer software that they’ve paid for. It should not.

10. When software is embedded in a product, the law governing the product should govern the software. Think of the software that controls the fuel injectors in a car. Should the car manufacturer be allowed to license this software instead of supplying it under the basic contract for the sale of the car? (Paper 1) (Paper 2). Under extended pressure from the software industry, the Article 2 amendments specify that software (information) is not “goods” and so is not within the scope of Article 2, even though courts have been consistently applying Article 2 to packaged software transactions since 1970. In the 48 states that have not adopted UCITA, this amendment would mean that there is no law in that state that governs transactions in software. The courts would have to reason by analogy, either to UCITA or to UCC 2 or to something else. When a product includes both hardware (the car) and software (the fuel injector software, braking software, etc.), amended Article 2 allows the court to apply Article 2 to the hardware and other law to the software. Thus different warranty rules could apply and even though you could sell your car used without paying a fee to the manufacturer, you might not be able to transfer the car’s software without paying that fee. Vendors should not be able to play these kinds of games. “Embedded software” is itself a highly ambiguous term. In those cases in which it is unclear whether software is embedded or not, the law should treate the software as embedded.

SWEBOK Problems, Part 2

June 27th, 2003

I’m going through my detailed review of SWEBOK, in preparation for the June 30 comment deadline. The bulk of this blog entry is a page-by-page commentary / critique that I will submit to the SWEBOK review. Before that, here are some contextual comments.

Please get involved in this review process, which will close on June 30. Go to www.swebok.org to sign up and download swebok, and submit comments.

Time is short, and you might not be able to read all of SWEBOK in time to submit detailed comments. That’s OK. I recommend that you download it, skim the parts that are most interesting, realize the extent to which it excludes modern methods (such as agile development) and, if this bothers you, you can submit a very simple comment.

You can say something like:

“I have reviewed SWEBOK. I manage software development staff
and play a role in their training and supervision. SWEBOK does not
provide a good basis for the structure or detail of the knowledge
that I want my staff to have. It emphasizes attitudes and practices
that are not helpful on my projects and it downplays or skips
attitudes and practices that I consider essential. I consider this
document fundamentally flawed, and if I could vote to disapprove it,
I would.”

Obviously, you would tailor this to your circumstances.

=======

Overall Concerns with SWEBOK

=======

SWEBOK was created using a strange process. They started with the table of contents of the main software engineering textbooks — as if there is a strong relationship between software engineering as described in textbooks and software engineering as practiced in the field. From there, SWEBOK developed as delta’s from these books. SWEBOK is focused on “established traditional practices recommended by many organizations” and is intended to exclude “practices used only for certain types of software” and to exclude “innovative practices tested and used by some organizations and concepts still being developed and testing in research organizations.”

Somehow, we conclude that mutation testing is an established traditional practice that is widely recommended and used, but we exclude scenario testing. We conclude that massive tombs of test documentation are an established traditional practice widely followed, even though rants about bad test documentation are, to say the least, a common theme of comment in the community. And we exclude consideration of requirements analytical techniques (or project context considerations) that might help you make a sensible engineering determination of what types of documentation, at what level of depth, for what target reader, are worth the expense of creating and (possibly) maintaining them.

In the SWEBOK, page IX, we learn that the purpose of SWEBOK is to provide a “consensually-validated characterization.” In this, SWEBOK has failed utterly. Only a few people (about 500) were involved in the project. It alienated leading people, such as Grady Booch who recently said (in a post to the extremeprogramming listserv on yahoogroups, dated 5/31/2003)

“I was one of those 500 earlier reviewers – and
my comments were entirely negative. The SWEBOK
I reviewed was well-intentioned but misguided,
naive, incoherent, and just flat wrong in so
many dimensions.”

The Association for Computing Machinery was a co-authoring, co-sponsoring organization of SWEBOK at one point. But ACM eventually commissioned task forces to study the document and the rationale underlying the effort, and the result was a deeply critical evaluation and ACM’s withdrawal from the project.

ACM is the largest association of computing professionals in the world. How can it be said, with a straight face, that SWEBOK is a consensually-validated document when the ACM, including leaders of the ACM Special Interest Group in Software Engineering, determine that the approach to creating the document and the result are fundamentally flawed? See http://www.acm.org/serving/se_policy/ for details.

The SWEBOK response (front page of www.swebok.org) was this:

“The following motion was unanimously adopted
on April 18 2001.

“The Industrial Advisory Board of the Guide
to the Software Engineering Body of Knowledge
(SWEBOK) project recognizes that due process
was followed in the development of the Guide
(Trial Version) and endorses the position that
the Guide (Trial Version) is ready for field
trials for a period of two years.”

I love this phrasing. “Due process” has a fine, legalistic, officious ring to it. It sounds good, and (speaking as an attorney who has experience using lawyerly terms like “due process”) it will intimidate or silence some critics. But if your acceptance criterion is consensus, and you have obviously failed to achieve consensus, then a term like “due process” is just so much smoke to confuse the issue. If the process fails to produce the required product, the fact (if it is a fact) that the process was followed doesn’t make the failure a non-failure.

==========

Detailed Evaluation Comments

==========

Here are my page-by-page comments on the testing section of SWEBOK. I have reviewed other parts of SWEBOK and have concerns about them too, but life is short and precious and there is only so much of mine that I am willing to dedicate to a criticism of a fundamentally flawed piece of work.

===========

Page 69. The document praises the role of testing as a preventative technique throughout the lifecycle, but doesn’t consider test-driven development, which I believe is the single most important type of early testing.

============

Page 69. The document defines software testing as follows: “Software testing consists of the dynamic verification of the behavior of a program on a finite series of test cases, suitably selected from the usually infinite executions [sic] domain, against the specified expected behavior.”

In fact, a great deal of testing is done without specifying expected behavior. Here are three examples:

(1) Exploratory testing is done partially to discover the behavior.

(2) Some types of high volume random testing check for indicators of failure without having any model of expected behavior. (It would be ludicrous to say that their model of the expected behavior is that the program will not have memory leaks, stack corruption or other specific defects.)

(3) Most forms of user testing fail to involve comparison to specified behavior, and the user who protests that a certain behavior in a certain context is inappropriate, confusing or unacceptable, might well not be able to articulate her expectations, even after the failure, let along specify them in advance. (In many cases, expectation is driven by similarity to other experiences and we know from research in cognitive psychology, e.g. from Lee Brooks’ lab at McMaster, that many people would be unable to describe the similarity space that is the basis for their judgments.)

These types of test are widely used by testers, and they have been widely used for decades. Good testing sometimes involves comparison to specified expected behavior, but it often does not.

=============

Page 70. The document provides a laundry list of test techniques with no obvious selection or exclusion principle.

One of the oddities on page 70 is the assertion that “branch coverage is a popular test technique.” HUH? What makes this a technique? You achieve branch coverage by running any group of tests that take the program through each branch. We could achieve this test objective (achieve a certain level of coverage) via scenario tests, domain tests, various other types of tests. We could achieve the objective by running tests at the unit level or the fully integrated system level. SWEBOK says that coverage “should not be considered _per_se_ as the objective of testing.” I share that opinion — it is a poor objective. But it appears to be the objective of many people who drive their testing in order to achieve this result. The fact that the authors of SWEBOK don’t like coverage as an objective doesn’t make it a technique.

Another strange page 70 assertion is that test techniques used primarily to expose failures are primarily domain testing. SWEBOK says, “These techniques variously attempt to “breakâ€? the program, by running one [or more] test[s] drawn from identified classes of (deemed equivalent) executions. The leading principle underlying such techniques is being as much systematic as possible in identifying a representative set of program behaviors (generally in the form of subclasses of the input domain).”

Yes, domain testing is the most commonly described technique in textbooks. It is simple, easy to understand, and easy to teach. But risk-based testing, scenario testing, stress testing, specification-focused testing, high-volume automated testing, state-model-based testing, transaction-flow testing, and heuristic-based exploratory testing are other examples of testing techniques that go after bugs in the product. Why ignore these in favor of domain testing?

Additionally, even though the textbooks most often talk in terms of subclasses of input domains, it is important and fruitful to also analyze the program in terms of its output domains, its interfaces with other devices (disk, printer, etc.) and other processes, and its internal intermediate-result variables. By focusing students (or worse, professionals) on input domains to the exclusion of the others, we virtually blind them to important problems. As the ACM pointed out in its evaluation of SWEBOK, a “body of knowledge” should be focused on competent practice, not on the descriptions in introductory books.

SWEBOK (p. 70) also tells us that to avoid confusing test objectives and techniques, we must clearly distinguish between measures of the thoroughness of testing and measures of the software under test (such as measures of reliability). SWEBOK also tells us that when we conduct testing “in view of a specific purpose”, then that specific purpose is the “test objective.” SWEBOK lists examples of reliability measurement, usability evaluation, and contractor’s acceptance as important examples of objectives. I think those are fine objectives. But if a regulatory requirement specifies that I must achieve a certain type of coverage, and I design tests to meet that requirement, then meeting that coverage target IS my specific purpose for those tests. I can think of several circumstances under which achievement of a level of thoroughness of a certain type of testing IS the specific purpose for running a set of tests. What principled basis does SWEBOK have in (apparently) rejecting these as invalid objectives?

One (failing) rationale for deciding that achieving a certain level of (some type of) coverage is not a valid objective is that we strive to achieve coverage in order to help achieve something else, such as reliability. That sounds good (in spite of the fact that in some situations, we strive to achieve a certain level of coverage primarily in order to be able to say we achieved that level of coverage), but the reasoning generalizes inconveniently. For example, in many organizations, we do usability testing in order to help achieve customer acceptance. So usability evaluation should not be a valid test objective (because in some contexts, coverage is to reliability as, in other contexts, usability is to acceptance). But SWEBOK specifically blesses usability evaluation and contractor (customer) acceptance as valid test objectives.

A test objective is the objective that drives the design and execution of the tests. Different objectives are appropriate in different contexts. SWEBOK has no business dismissing some objectives as non-objectives.

=================

SWEBOK page 70 states that “Software testing is a very expensive and labor-intensive part of development. For this reason, tools are instrumental for automated test execution, test results logging and evaluation, and in general to support test activities. Moreover, in order to enhance cost-effectiveness ratio, a key issue has always been pushing test automation as much as possible.”

The idea that we should be “pushing test automation as much as possible” has been a source of much mischief and misunderstanding. I frequently hear from experienced testers that their highest bug find rates are achieved using manual or computer-assisted one-time-use tests. I don’t believe that it is to our advantage to stop doing this type of testing. Instead, I think we should be “pushing” cost-benefit analysis and implementing automation when it is cost-effective. For additional discussion of cost/benefit analysis for automation, see my papers, Architectures of Test Automation (https://kaner.com/testarch.html) and Avoiding Shelfware: A Manager’s View of Automated GUI Testing (https://13j276.p3cdn1.secureserver.net/pdfs/shelfwar.pdf).

The idea that we are actually automating testing is itself a misconception. Let’s consider the most common form of test “automation”, GUI regression-level “automation”. It involves these tasks

TASK / DONE BY

Analyze the specification and other docs for ambiguity or other indicators of potential error

–> Done by humans

Analyze the source code for potential errors or other things to test

–> Done by humans

Design test cases

–> Done by humans

Create test data

–> Done by humans

Run the tests the first time

–> Done by humans

Evaluate the first result

–> Done by humans

Report a bug from the first run

–> Done by humans

Debug the tests

–> Done by humans

Save the code

–> Done by humans

Save the results

–> Done by humans

Document the tests

–> Done by humans

Build a traceability matrix (tracing test cases back to specs or requirements)

–> Done by humans or by another tool (not the GUI tool)

Select the test cases to be run

–> Done by humans

Run the tests

–> The Tool does it

Record the results

–> The Tool does it

Evaluate the results

–>The Tool does it, but if there’s an apparent failure, a human re-evaluates the results.

Measure the results (e.g. performance measures)

–> Done by humans or by another tool (not the GUI tool)

Report errors

–> Done by humans

Update and debug the tests

–> Done by humans

When we see how many of the testing-related tasks are being done by people or, perhaps, by other testing tools, we realize that the GUI-level regression test tool doesn’t really automate testing. It just helps a human to do the testing. Rather than calling this “automated testing”, we should call it computer-assisted testing. I am not showing disrespect for this approach by calling it computer-assisted testing. Instead, I’m making a point–there are a lot of tasks in a testing project and we can get help from a hardware or software tool to handle any subset of them. GUI regression test tools handle some of these tasks very well. Other tools or approaches will handle a different subset

We should use tools in software testing, but we should not strive for complete automation. It is the wrong goal.

======================

Page 71-73 diagrams

These pages provide some diagrams of the structure of the rest of the testing chapter. Several of the items on these pages are troubling, but I’ll refer to them in the context of the more detailed discussions in the rest of the chapter.

=====================

Page 74, definitions of fault, failure and defect.

I don’t disagree with the definitions of fault and failure. However, SWEBOK equates “fault” and “defect”, where “fault” refers to the underlying cause of a malfunction.

I have two objections to the use of the word defect.

(a) First, in use, the word “defect” is ambiguous. For example, as a matter of law, a product is dangerously defective if it behaves in a way that would be unexpected by a reasonable user and that behavior results in injury. This is a failure-level definition of “defect.” Rather than trying to impose precision on a term that is going to remain ambiguous despite IEEE’s best efforts, our technical language should allow for the ambiguity.

(b) Second, the use of the word “defect” has legal implications. While some people advocate that we should use the word “defect” to refer to “bugs”, a bug-tracking database that contains frequent assertions of the form “X is a defect” may severely and unnecessarily damage the defendant software developer/publisher in court. In a suit based on an allegation that a product is defective (such as a breach of warranty suit, or a personal injury suit), the plaintiff must prove that the product is defective. If a problem with the program is labeled “defect” in the bug tracking system, that label is likely to convince the jury that the bug is a defect, even if a more thorough legal analysis would not result in classification of that particular problem as “defect” in the meaning of the legal system.

We should be cautious in the use of the word “defect”, recognize that this word will be interpreted in multiple ways by technical and nontechnical people, and recognize that a company’s use of the word in its engineering documents might unreasonably expose that company to legal liability.

=====================

Page 75, The Oracle Problem

As Doug Hoffman has pointed out, oracles are heuristic. When we use an oracle to determine that a program has passed or failed a test, we are comparing the program to some model or expectation on some number of dimensions. The program can fail on other dimensions that the oracle is blind to. For example, if we use Excel as the oracle for a spreadsheet under development, and evaluate the formula A1+A2, we might set cell A1 to 2 and cell A2 to 3 in both program and get 5 in both cases. In terms of the oracle, our spreadsheet has passed this test. But suppose the new spreadsheet took 5 hours to evaluate A1+A2. This is unacceptable, but the oracle is oblivious to it.

The characterization of oracles as tools to decide whether a program behaved correctly on a given test, without discussion of the inherent fallibility of all oracles, has led to serious misunderstandings.

=======================

Page 75, Testability

The third common meaning of testability in practice refers to the extend to which the program is easy to test and the test results are easy to interpret. Thus a highly testable program provides a high level of _control_ (the tester might be able to change data, start the program at any point, etc.) and a high level of _visibility_ (the tester can determine the state of the program, the value of specific variables, etc.)

This is widely used and it guides negotiation among testers and programmers regarding the support for testing that will be designed into a program.

=======================

Page 75, Test Levels (Unit, Integration, System)

SWEBOK says “Clearly, unit testing starts after coding is quite mature, for instance after a clean compile.”

This is 100% in disagreement with the practice of test-driven development, which requires the programmer to write a unit test immediately _before_ writing code that will enable the program to pass the test.

I think that test-driven development is the most important advance in the craft of testing of the past 30 years. This, more than any of the other flaws, illustrates the extent to which SWEBOK is blind to modern good practice.

========================

Page 75, Test Levels (Unit, Integration, System)

Much testing now involves API-level driving of a component developed by someone else. I think this is neither unit, nor integration, nor system testing.

========================

Page 75, Test Levels (Unit, Integration, System)

In test-driven development, the programmer implements a test and then writes the code needed for the program to pass the test. (More precisely, implement the test, run the program and see how it fails, write the simplest code that can pass the test, run the program and fix until the program passes the test, then refactor the code and retest.

The first use of these tests is to guide and check the initial implementation of a few lines of code. In that sense, they are “unit” tests and they are often referred to as unit tests. However, as programming tools evolve, these tests often look at the lines of code in the context of several other features. Using tools like Ward Cunningham’s FIT, for example, programmers might create many “unit” tests that are also “integration” (multi-variable, multi-function) and “system” (check whether an intended benefit will actually be provided to the end user).

I don’t think we should ban the use of the terms “unit”, “integration” and “system.” However, thinking about these as THE THREE LEVELS of testing, as defining the 3 targets of testing, leads to blind spots with respect to the nature of targets and the potential focus of individual tests.

========================

Page 75-77, Objectives of Testing

I think the categorization of testing concepts is strikingly odd. Here, I note the oddness of the list of testing objectives.

SWEBOK lists
– conformance testing, which it equates to functional testing
– reliability testing
– usability testing
– acceptance / qualification testing
– installation testing
– alpha and beta testing
– regression testing
– performance testing
– stress testing
– back-to-back testing
– recovery testing
– configuration testing, and
– usability testing.

This seems like a laundry list. Back-to-back testing looks more like a technique than an objective. Several others of these could be classed in different ways.

More important, what determines inclusion on this list?

For example, I think of objectives of testing as including:
– minimize liability risk
– decision support (help a project manager determine whether
to release the product)
– compliance with regulations or the expectations of a
regulatory inspector (this may or may not involve
conformance with a specification)
– assess and improve safety
– determine the nature of problems likely to arise in long
use of the product
– expose defects
– block premature release of a product
– improve the user experience

and many others.

In looking at the various lists in this document, I cannot divine a principle that governs inclusion versus exclusion.

As a teacher, I think that many things off the list are more important than the things on the list.

===================

Page 76. Regression testing

SWEBOK defines regression as

“the selective retesting of a system or component to verify that modifications have not caused unintended effects. In practice, the idea is to show that previously passed tests still do.” It then refers to “the assurance given by regression testing” [. . .] “that the software’s behavior is unchanged.”

An earlier version of SWEBOK noted that “regression testing” is commonly used to mean retesting the program to determine whether a bug was fixed. This is a popular definition. Why is it excluded?

Another common definition of regression testing is retesting the program to determine whether changes have caused fixed bugs to be re-broken.

If SWEBOK is a description of what is generally known and done, it should not select one definition and objective and exclude other common ones without even mentioning them.

Next, consider the idea that we run a bunch of tests again and again in order to assure ourselves that software behavior is unchanged. The regression test suite is a relatively tiny collection of tests that can only look at a relatively small proportion of the system’s behaviors. Our gamble is that the software’s behavior is not changing in ways missed by the regression tests. I have never seen a convincing theoretical argument that a regression test suite will expose most or all possible behavioral changes of a program and therefore I reject the notion that regression testing provides “assurance.”

The initial definition of regression testing is quite different from the idea of “assuring that system behavior is unchanged.” The definition is “verify that modifications have not caused unintended effects.” Let’s restate this definition in less antiquated terms — let’s talk about RISK instead of VERIFICATION.

That yields the idea that regression testing is done to mitigate the risk of unintended side effects of change.

A risk-based view of regression testing no longer requires us to use the same test, time and again, to study aspects of the program that have been previously tested. You can change data or combine a test with other tests or do other creative things to search for side effects. By varying the tests, you give yourself the chance to find previously-missed bugs — problems that were in the software all along but that you have missed with your tests so far — along with catching some side-effects. You are increasing coverage instead of mindlessly repeating the same old thing.

And under the risk-based view, you don’t fool yourself or defraud others with the idea that a small set of tests verifies some quality characteristic of the product. On page 75, SWEBOK approvingly noted Dijkstra’s insight that you can show the presence of bugs, but not their absence. This insight is flatly inconsistent with claims that we do any type of testing to “assure” or “verify”.

This conceptual contradiction illustrates the extent to which SWEBOK (as reflected in the testing section) seems to be more like a dumpster of testing concepts than like a conceptually coherent presentation.

As a dumpster (a disorganized collection of a miscellany of concepts) it is problematic because of the number and nature of things that have been kept out of the dumpster.

(NOTE: The analysis of regression testing above is mainly an analysis of system-level regression testing. I very much like the idea of creating an extensive suite of unit-level change-detectors, tests that we mainly create test-first, that cover every line and branch that we write. The difference between unit-level regression tests and system level is cost. The programmer runs the change-detector suite, every time she recompiles the code. If her change breaks the build, she fixes it immediately.

The labor cost associated with an independent tester discovering a regression error (which might be a bug in the program but is very often a test announcing that it must be changed to conform to a revised design) is quite high. Counting all people involved in the process, the time from failure through bug reporting to bug evaluation, prioritization, repair, and retesting will often total to an average of 4 labor-hours and or even higher (much higher) in some organizations.

In contrast, with the unit level change-detector suite, the programmer discovers the problem when she compiles the code, and immediately either fixes the code or the test. The labor cost is minimal. The practice is cheap enough that we can use it to support refactoring. The cost associated with traditional system-level regression testing is so high that we could not use it to support refactoring. The high communcation cost drives the cost of late change through the roof (one of the factors of the exponential growth curve of the cost of change over project time) whereas the absence of communication cost associated the unit-level change detectors allows us to make late changes at relatively low cost.

This is an important distinction within the description of a technique that SWEBOK says can be run at the system level or the unit level. We can do regression testing at either level, but their costs, benefits and uses are entirely different. It’s too bad that SWEBOK misses this point.

====================

Page 77 Test Techniques

This is another laundry list that excludes important current techniques, includes techniques that seem to be not widely used, and doesn’t expose any principled basis for inclusion or exclusion.

====================

Page 77 “Ad hoc testing”

SWEBOK says this (which it equates to exploratory testing) is the most widely practiced technique, and then advises a more systematic approach and then says that only experts should do ad hoc testing.

This is a blatant admission that the SWEBOK drafters simply don’t understand the most widely practiced approach to testing. SWEBOK cites my book as its source for testing “based on tester’s intuition and experience” and I believe I am the person who coined the term, “exploratory testing”, so let’s look at what this is. (Much of this material was developed by or with James Bach and many other colleagues over a 20 year period.)

First, exploratory testing involves simultaneous design, execution, and learning about the program. Rather than design tests and then run them, you do some testing, learn from them, learn from other sources, and base your design of next tests on your new insights. Your oracle (set of evaluation criteria) evolve as you learn more.

Second, every competent tester does exploratory testing. If you report a bug and the programmer tells you it was fixed, you do some testing around the fix. One test is the test that exposed the bug in the first place. But if you’re any good, you create additional tests to see if the fix is more general than the specific circumstances reported in the bug report, and to see if there were side effects. These are not pre-planned, pre-specified tests. They are designed, run, evaluated and extended in the moment. I use this testing situation as a basic training ground for junior testers (and classroom students). Surely, it is not something we would leave only to the experts. But SWEBOK tells us that “A more systematic approach is advised” and “ad hoc testing might be useful but only if the tester is really expert!”

There is such a thing as systematic exploratory testing.

I’m not trying to write SWEBOK here, but rather to support my assertion that SWEBOK seem to be clueless about a body of work that even it describes as the most widely practiced technique in the field.

If SWEBOK is intended to describe the current body of knowledge and practice in the field, its cluelessness about the most widely practiced approach is inexcusable.

===============

Page 80

My final comment on SWEBOK’s testing section has to do with its comments on test documentation.

“Documentation is an integral part of the formalization of the test process. The IEEE standard for Software Test Documentation [829] provides a good description of test documents and of their relationship with one another and with the testing process. . . . Test documentation should be produced and continually updated, at the same standards as other types of documentation in development.”

This is pious-sounding claptrap, religious doctrine rather than engineering. There is far too much of this in SWEBOK.

Is IEEE Standard 829 a good description?

In my experience, I have seen 829 applied by many commercial software companies or commercial companies that were developing software as part of their support process for their main business. I have never seen a case in which a commercial software application was benefitted more than it was harmed by application of standard 829. Several colleagues of mine have had the same experience. Bach, Pettichord and I discuss the problems in Lessons Learned in Software Testing.

Normally, an engineering body of knowledge includes assertions that are based on theory and tested by experiment. There was no theoretical basis underlying Standard 829. I am not aware of any experimental research of the costs and benefits associated with the application of 829.

Test document is expensive.

Testing is subject to a very difficult constraint. We have an infinity of potential tests and a very limited amount of time in which to imagine, create, document, and evaluate the results of running a few of those many possible tests.

Time spent generating paperwork is time not available for test implementation, test tool development, test execution and evaluation.

Good practice, therefore, probably pushes toward cost-benefit evaluation on a case by case basis. If a certain type of document is so valuable for the current project that it is worth taking time away from competing tasks in order to create the document, create the document.

Before we can pronounce that test documentation should be continually updated, we should discover why the test documentation is being created and how it will be used in the future. Maybe updating is called for. Maybe not. Maybe the documentation should be up to the standards of other documentation on the project, but maybe not. It depends on who will use the documents, and for what purpose.

=====================

In Sum

My time is limited.

I could write pages and pages more about the weaknesses of SWEBOK, but I think it would be pointless.

I agree with the ACM appraisal that the SWEBOK started with a fundamentally flawed approach. The result continues to be fundamentally flawed.

The call for comments on SWEBOK asked for appraisal of SWEBOK as it relates to teaching.

I teach courses in software testing. SWEBOK is not a good reference point for them.

SWEBOK’s criteria for inclusion and exclusion of topics is unsatisfactory. Many of the most important topics in my testing courses, (such as test driven development, API-level testing, scenario testing, skilled exploration, the difference in objectives and cost/benefit for unit-level regression suites and system-level regression suites, risk-based testing, a risk-based approach to domain testing instead of the stale, 40-year-old, boundary/equivalence approach documented in SWEBOK), effective bug reporting, using requirements analysis techniques to drive decisions about the types of artifacts to be generated, and on and on, are absent from SWEBOK. Much of what is present in SWEBOK is organized strangely, is dated, and many of the techniques (etc.) are marginal in terms of how often they are used and what value they actually provide.

I have appraised SWEBOK against my course notes, which I update regularly (and which I am updating again this summer). My conclusion was that SWEBOK’s flaws are so severe that, on balance, it is a less-than-worthless reference point for discovery of opportunities to improve the notes.

I also teach courses in software metrics.

Measurement, as studied in other fields, normally involves extensive study of validity of measures and threats to validity. One of the most important validity questions is how can we tell whether the measure actually measures what it purports to measure. What model or theory (and associated empirical support) relates the number we obtain (a complexity level of “10”) to the underlying attribute we are trying to measure?

Another critical question involves side effects of measurement. Robert Austin’s book, Measuring and Managing Performance in Organizations, discusses this in detail.

The review of measurement theory in SWEBOK (page 174 and on) skips lightly past these issues and provides a laundry list of metrics, including many that are invalid and unvalidated “metrics” — to the extent that it is clear what attribute they are actually intended to measure. The main value of the SWEBOK treatment of measurement is that it is concise. It makes an excellent “straw man”, something I can hand out and enthusiastically criticize. This is probably not the educational use we would hope to obtain from something that is SUPPOSED TO serve as the basis for a licensing exam.

=============

CLOSING ASSERTION

THERE IS NO BALLOTING PROCESS FOR SWEBOK THIS TIME. IF THERE WAS, I WOULD VOTE THAT SWEBOK SHOULD NOT BE ACCEPTED, WITH OR WITHOUT MODIFICATION.

— Cem Kaner
— Professor of Software Engineering
— Florida Institute of Technology

IEEE’s “Body of Knowledge” for Software Engineering

June 17th, 2003

SOFTWARE ENGINEERING’S “BODY OF KNOWLEDGE�

The IEEE Computer Society has been developing its own statement of the Software Engineering Body of Knowledge (SWEBOK). They are now calling for a review of SWEBOK, which you can participate in at www.swebok.org.
According to their Call for Reviewers (email, May 29, 2003:

“The purpose of the Guide is to characterize the contents of the software engineering discipline, to promote a consistent view of software engineering worldwide, to clarify the place of, and set the boundary of, software engineering with respect to other disciplines, and to provide a foundation for curriculum development and individual licensing material. All deliverables are available without any charge at www.swebok.org.”

SWEBOK pushes the traditional, documentation-heavy approaches. I have read several drafts of it over the years but I chose to not be involved in the official process because I believed that:

  • The document had little merit and probably wouldn’t get much better;
  • My comments wouldn’t have much influence.
  • These grand, in my view highly premature, efforts to standardize and regulate the field come and go but don’t really have enough influence to worry about.

In retrospect, I think that keeping away from SWEBOK was a mistake. I think it has the potential to do substantial harm. I urge you to get involved in the SWEBOK review, make your criticisms clear and explicit, and urge them in writing to abandon this project. Even though this will have little influence with the SWEBOK promoters, it will create a public record of controversy and protest. Because SWEBOK is being effectively pushed as a basis for licensing software engineers and evaluating / accrediting software engineering degree programs, a public record of controversy may play an important role.

LICENSING

Should software engineers be licensed as engineers?
One of the key reasons for the creation of the SWEBOK was to support political moves to license software engineers. This is from the SWEBOK Project Overview

“A core body of knowledge is pivotal to the development and accreditation of university curricula and the licensing and certification of professionals. Achieving consensus by the profession on a core body of knowledge is a key milestone in all disciplines and has been identified by the Coordinating Committee as crucial for the evolution of software engineering toward a professional status. The Guide to the Software Engineering Body of Knowledge project is an initiative completed under the auspices of this Committee to reach this consensus. “

In a series of studies, the Association for Computing Machinery recommended against licensing. I was a member of one of the ACM’s study panels, the one that considered the relationship between licensing and safety-critical software.
I think that licensing engineers in our profession today is premature and likely to do serious harm to the profession. I don’t say this lightly. I’ve thought about it for a long time, and from many perspectives (I am a full Professor of Software Engineering, an Attorney who has a strong interest in malpractice law, and a person who has almost 20 years experience in commercial software development (programming, designing user interfaces, testing, tech writing, managing programmers, testers, and writers, consulting, negotiating contracts, etc.).

SWEBOK

The SWEBOK is written as the basis for licensing exams for professional software engineers. If your state requires you to get a license to practice software engineering (and more will, if they are convinced that they can create fair exams based on a consensus document from the profession), the SWEBOK is the document you will have to study.
If the SWEBOK is the basis for the licensing exam, the practices in the SWEBOK will be treated as the basis for malpractice lawsuits. People who do what is called good practice in SWEBOK will be able to defend their practices in court if they are ever sued for malpractice. People who adopt what might be much better practices, but practices that conflict with the SWEBOK, will risk harsh criticism in court. As the basis for a licensing exam, SWEBOK becomes as close to an Official Statement of the approved practices of the field as a licensed profession is going to get.

So what’s in this SWEBOK?

The IEEE SWEBOK is a statement of “generally accepted practices�, which are defined as “established traditional practices recommended by many organizations.� SWEBOK is NOT a document intended to include “specialized� practices, which are “practices used only for certain types of software� nor for “advanced and research� practices, which are “innovative practices tested and used only by some organizations and concepts still being developed and tested in research organizations.
I am most familiar with SWEBOK’s treatments of software testing, software quality and metrics. It endorses practices that I consider wastefully bureaucratic, document-intensive, tedious, and in commercial software development, not likely to succeed. These are the practices that some software process enthusiasts have tried and tried and tried and tried and tried to ram down the throats of the software development community, with less success than they would like.
By promoting these document-centered, rigid practices in a document that serves as the basis for licensing of software engineers, the SWEBOK committee can drive adoption of these practices to a much greater degree than practitioners have accepted voluntarily.
The Association for Computing Machinery assessed SWEBOK and concluded it was seriously flawed and that ACM (originally a partner in development of the SWEBOK) should withdraw from the process. SWEBOK was ultimately adopted as a “consensus� document based on votes from fewer than 350 reviewers, in the face of criticism and walkout by the largest association of computing professionals in the world.

IN SUM

Only 500 people participated in the development of SWEBOK and many of them voiced deep criticisms of it. The balloted draft was supported by just over 300 people (of a mere 340 voting). Within this group were professional trainers who stand to make substantial income from pre-licensing-exam and pre-certification-exam review courses, consulting/contracting firms who make big profits from contracts that (such as government contracts) that specify gold-plated software development processes (of course you need all this process documentation—the IEEE standards say you need it!), and academics who have never worked on a serious development project. There were also experienced, honest people with no conflicts of interest, but when there are only a few hundred voices, the voices of vested interests can exert a substantial influence on the result.
I don’t see a way to vote on the 2003 version of SWEBOK. If I did, I would urge you to vote NO.
But even though you cannot vote to disapprove this document, you can review it, criticize it, and make clear the extent to which it does fails reflect the better practices in your organization.
To the extent that it is clear that there is no consensus around the SWEBOK, engineering societies will be less likely to rely on it in developing licensing exams (and less likely to push ahead with plans to license software engineers), and judges and juries will be less likely to conclude that “It says so in the SWEBOK. That must be what the best minds in the profession have decided is true.�
Please, go to www.swebok.org ASAP.
Comments at www.swebok.org are welcome until July 1, 2003.

SPAM, Filtering, and Commercial Legislation

May 1st, 2003

The Federal Trade Commission is in the midst of hearings on SPAM. Congressional leaders are promising anti-spam legislation. The biggest headline getter at the moment is Charles Schumer’s promise to introduce a national opt-out registry within a few weeks. Actually, such a bill already exists, sponsored by Mark Dayton along with an added bonus, a federal study of software companies’ technical support practices.

I think we need we a national opt-out list as a first level of defense in dealing with spam because the more obvious first line of defense, the spam filter used by your ISP or installed on your machine, subjects you to important risks. These risks arise from electronic communications rules in the Uniform Electronic Transactions Act (UETA) and the Uniform Computer Information Transactions Act (UCITA) .

The problem is that under these bills, if you someone sends you an important legal notice that looks like spam to your ISP’s filter or the one that runs locally in conjunction with your e-mail client, you won’t see it, but you will still be legally accountable for knowing its contents. There are two significant risks here. First, you might ACCIDENTALLY miss some important mail (like a mortgage foreclosure notice). Second, someone who is required by law to send you a notice but doesn’t want you to read it can INTENTIONALLY distribute it in a way that is likely to trigger your spam filter. In both cases, you lose.
Read the rest of this entry »

Final Exam for Software Testing 2 Class

March 9th, 2003

Final Exam for Software Testing 2 Class

People often ask me about the difference between commercial certification in software testing and university education. I explain that the tester certification exams typically test what people have memorized rather than what they can do.

Several people listen to this politely but don’t understand the distinction that I’m making.

I just finished teaching Software Testing 2 at Florida Tech. This is our first round with the course. Next year’s edition will be tougher. (As we work out the kinks, the course will cover more and better each time we teach it, for at least three more teachings.) Even though the course was relatively easy, I think our course’s final exam illustrates the difference between a certification exam and a university exam.

Comments welcome. Send them to me at weblog@kaner.com

FINAL EXAM–SOFTWARE TESTING 2
April 17 – 25, 2003
Due April 25, 2003. I will accept late exams – without late penalty— up to 5:00 p.m. on May 1. No exams will be accepted after 5 p.m. May 1. It is essential that you work on this exam ALONE with no input or assistance from other people. You MAY NOT discuss your progress or results with other students in the class.

MAIN TASK
Use Ruby to build a high volume automated test tool to check the mathematics in Open Office spreadsheet

Total points available = 100

RUBY PROGRAM DEVELOPMENT — 25 points
1. Your development of the Ruby program should be test-driven. Use testunit (or runit) to test the Ruby program. Show several iterations in the test-driven development.

PICK YOUR FUNCTIONS — 10 points

2. You will test OpenOffice 1.0 by comparing its results to results you get from Microsoft Excel.

2a Choose five mathematical or financial functions that take one or two parameters

2b. Choose five mathematical or financial functions that take many parameters (at least 3)

INDIVIDUAL FUNCTION TESTS — 25 Points

3. Your program should generate random inputs that you will feed as parameter values to the functions that you have selected:
For each function, run 100 tests as follows

* Generate the input(s) for this function. The set you use should be primarily valid, but you should try some invalid values as well.
* Determine whether a given input is a valid or invalid input and reflect this in your output
* Evaluate the function in OpenOffice
* Evaluate the function in Excel
* Compare the results
* Determine whether the results are sufficiently close
* Summarize the results, across all 100 tests of this function

CONSTRUCT FORMULAS — 20 Points

4. Now test formulas that combine functions from the 10 functions you have used so far.

4a. Create and test 5 interestingly complex formulas. Evaluate them with 100 tests each, as you did for functions in Part 3.

RANDOMLY CONSTRUCTED FORMULAS — 20 Points

5 Now test random formulas using the same 10 functions you have used so far.

5a For 100 test cases, randomly create a formula, and randomly generate VALID input data. From here,

* Evaluate the formula in OpenOffice
* Evaluate the formula in Excel
* Compare the results
* Determine whether the results are sufficiently close
* Summarize the results of these 100 tests

BONUS PROBLEM — 20 Points

6. In questions 4 and 5, you probably discovered that you could supply a function with an input value that was valid, but then the function evaluated to a value that was not valid for the function that took this as input.

For example log (cosine (90 degrees)) is undefined. The initial input (90 degrees) is valid. Cosine evaluates to 0, which is valid, but log(0) is undefined and so cosine(90) is invalid as an input for log.

Describe a strategy that you would use to guarantee that the formula evaluates to a valid, numeric result.