Summary: This overview describes the general concept of high volume automated testing (HiVAT) and twelve examples of HiVAT techniques. The common thread of these techniques is automated generation, execution and evaluation of arbitrarily many tests. The individual tests are often weak, but taken together, they can expose problems that individually-crafted tests will miss. The simplest techniques offer high coverage: they can reach a program’s weakness in handling specific special cases. Other techniques take the program through long sequences of tests and can expose programs that build gradually (e.g. memory leaks, stack corruption or memory corruption) or that involve unexpected timing of events. In my experience, these sequences have exposed serious problems in embedded software, especially in systems that involved interaction among a few processors. As we embed more software into more consumer systems, including systems that pose life-critical risks, I believe these tests will become increasingly important.
I am writing this note as an intentionally rough draft. It servers as an introduction for a course on HiVAT at Florida Institute of Technology. It provides a structure for work in the course. It is intentionally under-referenced. One of the students’ tasks in the course is to dig through a highly disjointed literature and map research papers, practitioner papers, and conference presentations to the techniques listed here. Students might also add HiVAT techniques, with associated papers, or add sections on directly-relevant automation-support technology or directly-relevant surveys of test automation strategies / results. I will replace this post with later drafts that are more academically complete as we make progress in the course.
Consider automated regression testing. We reuse a regression test several times–perhaps running it on every build. But are these really automated? The computer might execute the tests and do a simple evaluation of the results, but a human probably designed that test, a human probably wrote the test code that the computer executes, a human probably provided the test input data by coding it directly into the test or by specifying parameters in an input file, a human probably provided the expected results that the program uses to evaluate the test result, and if there appears that there might be a problem, it will be a human who inspects the results, does the troubleshooting and either writes a bug report (if the program is broken) or rewrites the test. All that work by humans is manual.
Notice, by the way, that this interplay of human work and computer work applies whether the tests are run at the system level using a tool like QuickTest Pro, at the unit level using a tool like jUnit, or at a hybrid level, using a tool like FIT.
So, automated regression test are actually manual tests with an automated tint.
And every manual software test is actually automated. When you run a manual test, you might type in the inputs and look at the outputs, but everything that happens from the acceptance of those inputs to the display of the results of processing them is done by a computer under the control of a program–that’s automated.
The difference between “manual” and “automated” is a matter of degree, not a matter of principle. Some tests are more automated, others are less automated, but there is no “all” and there is no “none” for test automation.
High-Volume Tests Focused on Inputs
High-Volume Parametric Variation
We can make the automated regression tests more automated by transforming some of the human tasks to computer tasks. For example, imagine testing a method that takes two inputs (FIRST, SECOND) and returns their sum (SUM). A typical regression test would include specific values for FIRST, SECOND, and EXPECTED_SUM. But suppose we replace the specific test values with a test data generator that supplies random values for FIRST and SECOND and calculates the value of EXPECTED_SUM. We can generate billions of tests this way, each a little different, with almost no increase in human effort.
This is one of the simplest examples of high volume automated testing. A human supplied the algorithm, but the test tool applies that algorithm to create, run, and evaluate the results of arbitrarily many tests.
In this example, those tests are all pretty similar. Why bother running them all? The traditional answer is “don’t bother.” Domain testing is most widely used software testing technique. The point of domain testing is to help the tester minimize test redundancy by selecting a small number of inputs to represent the larger set of possibilities. Usually, this works well. Occasionally, a program has a problem with a small number of specific inputs. For example, the program might be optimized in a way that processes a few specific values specially. Or it might be vulnerable to small calculation errors that are usually too small to notice, but occasionally have a visible effect (see Doug Hoffman’s report at http://www.testingeducation.org/BBST/foundations/Hoffman_Exhaust_Options.pdf for an example.
High-Volume Combination Testing
Rather than thinking of testing a single function that takes a few inputs, imagine something more complex. A program processes several inputs, perhaps with a series of functions, and reports an output. Again, the traditional approach is to minimize redundancy. When we do combination tests (test several variables together), the set of possible tests is usually huge. If the variables are independent, we can minimize the number of combination tests with combinatorial testing (e.g. all-pairs or all-triples). If the variables are related, we can use cause-effect graphing instead. But if we suspect that the program will fail only on a small number of specific combinations (and we don’t know which few those are), we have to test a larger set of combinations, generating input values and calculating the expected result for each combination.
There are different strategies for generating the input values:
- Exhaustive sampling. This tests all the combinations but the set might be impossibly large.
- Random sampling. Generate random values for the inputs, stopping when some large number have been tested.
- Optimized sampling. Use an algorithm that optimizes the set of combinations in some way. As a very simple example, if you are going to test a million combinations, you could divide the space of combinations into a million same-size, non-overlapping subsets and sample one value of each. Or you could use a sequential algorithm that assigns values for the next combination by creating a test that is distant (according to some distance function) from all previous tests.
So far, I’ve presented tests that generate both, the inputs and the expected value of the test. The expected value serves as an oracle, a mechanism for deciding whether the program passed or failed the test. Test oracles are incomplete, but they are useful for automation. Even if we can’t detect all possible failures, we can detect any failures of a certain kind (such as a calculation error).
“Fuzzing” refers to a family of high-volume automated tests that vary the inputs but have no oracle. The test runs until the program crashes or fails in some other unmissable way.
Hostile Data Stream Testing
Alan Jorgensen tested for many types of security errors by taking a good file in a standard format (e.g. PDF) and corrupting by substituting one string in the file with another. In Jorgensen’s work, the new string was typically much longer or was syntactically different. Jorgensen would then open the now-corrupt file with the application under test. The program might reject the file as corrupt or accept it. If it accepted the file, Jorgensen could ask the program to use the file in some way (e.g. display it) and look for misbehavior. In some cases, Jorgensen could exploit a failure to recognize a corrupt file by embedding executable code that the program would then inappropriately execute. Jorgensen’s search for corruptions that would escape detection was an intensely automated activity. His analysis of exploitability was not.
There are several published variations of file fuzzing (variation of the contents or format of an input file) that are suited for different kinds of risks.
High-Volume Tests that Exploit the Availability of an Oracle
Having an oracle gives you a strong basis for high-volume automated testing. All that’s required is to generate inputs to the program under test and to drive the program to process the inputs. Given a result, you can use the oracle to check whether the program passed the test. No oracle is perfect: the program might fail the test even though its output matches the oracle’s. But even though you can’t learn everything that might possibly be interesting by relying on an oracle, you have a basis for running a boatload of tests that collectively check thoroughly for some types of failures and give you an opportunity to stumble onto some other types of failures (e.g. crashes from memory leaks) that you didn’t design the tests to look for. For any other ways that you can imagine the program might fail, you can design tailored manual tests as needed.
High-volume testing with an oracle won’t completely test the program (against all possible risks), but it will provide you with a level of coverage that you can’t achieve by hand.
Function Equivalence Testing
Function equivalence testing starts with an oracle–a reference function that should work the same way as the function under test. Given the reference function, you can feed inputs to the function under test, get the results, and check whether the reference function gives the same results. You can test with as many input values as you want, perhaps generating a large random set of inputs or (if you have enough available computer time) testing every possible input.
The final exam in my Programmer-Testing course illustrates function equivalence testing. The question that students have to answer is whether Open Office Calc does its calculations the same way as Microsoft Excel. We adopt a quality criterion: If Calc does it the same way as Excel, that’s good enough, even if Excel makes a few calculation errors.
To answer the question, the students:
- test several functions individually
- then test those functions together by testing formulas that combine several functions
To do this,
- The students pick several individual functions in Calc
- They test each by feeding random inputs to the Calc function and the same inputs to the corresponding Excel function.
- Then they create random formulas that combine the functions, feed random data to the functions in the formula, and compare results.
If you test enough inputs, and the Calc results are always the same as Excel’s (allowing a little rounding error), it is reasonable to conclude that the calculation in Calc is equivalent to the calculation in Excel.
We use a constraint oracle to check for impossible values or impossible relationships.
For example an American ZIP code must be 5 or 9 digits. If you are working with a program that reads (or looks up or otherwise processes) ZIP codes, you can check every code that it processes. If it accepts (treats as a ZIP code) anything that has non-numeric characters or the wrong number of characters, then it has a bug. If you can find a way to drive the program so that it reads (or does whatever it does with) lots of ZIP codes, you have a basis for a set of high-volume automated tests.
Imagine taking a list that is sorted from high to low, sorting it low to high, then sorting it back (high to low). If you can give this program enough lists (enough sizes, enough diversity of values), you can eventually conclude that it sorts correctly (or not). Any operation that you can invert, you can build a high volume test series against.
State-Model Based Testing (SMBT)
If you have a state model (including a way to decide where the program should go if you give it an input and a way to determine whether the program actually got there), you can feed the program an arbitrarily long sequence of inputs and check the results.
I think the most common way to do SMBT is with a deterministic series of tests, typically the shortest series that will achieve a specified level of coverage. The typical coverage goal is every transition from each possible state to every state that it can reach next. You can turn this into a high-volume series by selecting states and inputs randomly and running the sequence for an arbitrarily long time. Ben Simo cautions that this has to be monitored because some programs will get locked into relatively tight loops, never reaching some states or some transitions. If you write your test execution system to check for this, though, you can force it out of the loop and into a not-yet hit state.
I worked at a company that designed telephone systems (Telenova). Our phones gave customers a menu-driven interface to 108 voice features and 110 data features. Imagine running a system test with 100 phones calling each other, putting each other on hold, transferring calls from one phone to another, then conferencing in outside lines, etc. We were never able to create a full state model for our system tests. They were just too complex.
Instead, the Telenova staff (programmers, testers, and hardware engineers) designed a simulator that could drive the phones from state to state with specified or random inputs. They wrote probes into the code to check whether the system was behaving in unexpected ways. A probe is like an assert command, but if the program triggers the probe, it logs an error instead of halting the program. A probe might check whether a variable had an unexpected value, whether a set of variables had unexpected values relative to each other, whether the program went through a series of states in an unexpected order, etc.
Because these probes checked the internal state of the system, and might check any aspect of the system, we called them diagnostics.
Implementing this type of testing required a lot of collaboration. The programmers wrote probes into their code. Testers did the first evaluations of the test logs and did extensive troubleshooting, looking for simple replication conditions for events that showed up in the logs. Testers and programmers worked together to fix bugs, change the probes, and specify the next test series.
As challenging as the implementation was, this testing revealed a remarkable number of interesting problems, including problems that would have been very hard to find in traditional ways but had the potential to cause serious failures in the field. This is what convinced me of the value of high-volume automated testing.
High-Volume Tests that Exploit the Availability of Existing Tests or Tools
Sometimes, the best reason to adopt a high-volume automated technique is that the adoption will be relatively easy. That is, large parts of the job are already done or expensive tools that can be used to do the job are already in place.
Long-Sequence Regression Testing (LSRT)
For example, Pat McGee and I wrote about a well-known company that repurposed its huge collection of regression tests of its office-automation products’ firmware. (In deference to the company’s desire not to be named, we called it Mentsville.)
When you do LSRT, you start by running the regression tests against the current build. From that set, pick only tests that the program passes. Run these in random order until the program fails (or you’ve run them for a long enough time). The original regression tests were designed to reveal functional problems, but we got past those by using only tests that we knew the program could pass when you ran them one-at-a-time. The bugs we found came from running the tests in a long series. For example, the program might run a series of 1000 tests, thirty of them the same as the first (Test 1), but it might not fail Test 1 until that 30th run. Why did it fail on time 30 and not before?
- Sometimes, the problem was a gradual build-up of bad data in the stack or memory.
- Sometimes, the problem involved timing. For example, sometimes one processor would stay busy (from the last test) for an unexpectedly long time and wouldn’t be ready when it was expected for this test. Or sometimes the firmware would expect a location in memory to have been updated, but in this unusual sequence, the process or processor wouldn’t yet have completed the relevant calculation.
- Sometimes the problem was insufficient memory (memory leak) and some of the leaks were subtle, requiring a specific sequence of events rather than a simple call to a single function.
These were similar to the kinds of problems we found at Telenova (running sequences of diagnostics-supported tests overnight).
Troubleshooting the failures was a challenge because it was hard to tell when the underlying failure actually occurred. Something could happen early in testing that wouldn’t cause an immediate failure but would cause gradual corruption of memory until hours later, the system crashed. That early result was what was needed for replicating the failure. To make troubleshooting easier, we started running diagnostics between tests, checking the state of memory, or how long a task had taken to execute, or whether a processor was still busy, or any of over 1000 other available system checks. We only ran a few diagnostics between tests. (Each diagnostic changed the state of the system. We felt that running too many diagnostics would change the state too much for us to get an accurate picture of the the effect of running several regression tests in a row.) But those few tests could tell us a great deal. And if we needed more information, we could run the same 1000-test sequence again, but with different diagnostics between tests.
High-Volume Protocol Testing
A protocol specifies the rules for communication between two programs or two systems. For example, an ecommerce site might interact with VISA to bill a customer’s VISA card for a purchase. The protocol specifies what commands (messages) the site can send to VISA, what their structure should be, what types of data should appear in the messages and where, and what responses are possible (and what they mean). A protocol test sends a command to the remote system and evaluates the results (or sends a related series of commands and evaluates the series of responses back).
With popular systems, like VISA, lots of programs want to test whether they work with the system (whether they’ve implemented the protocol correctly and how the remote system actually responds). A tool to run these types of tests might be already available, or pieces of it might be available. If enough is available, it might be easy to extend it, to generate long random sequences of commands to the remote system and to process the system’s responses to each one.
Load-Enhanced Functional Testing
Common lore back in the 1970’s was that a system behaved differently, functionally differently, when it was running under load. Tasks that it could perform correctly under “normal” load were done incorrectly when the system got busy.
Alberto Savoia described his experience of this in a presentation at the STAR conference in 2000. Segue (a test automation tool developer) built a service around it. As I understand their results, a system could appear 50%-busy (busy, but not saturated) but still be unable to correctly run some regression tests. The value of this type of testing is that it can expose functional weaknesses in the system. For example, suppose that a system running low on memory will rely more heavily on virtual memory and so the timing of its actions slows down. Suppose that a program running on the system spreads tasks across processors in a way that makes it vulnerable to race conditions. Suppose that Task 1 on Processor 1 will always finish before Task 2 on Processor 2 when the system is running under normal load, but under some heavy loads, Processor 1 will get tied up and won’t process work as quickly as Processor 2. If the program assumes that Task 1 will always get done before Task 2, that assumption will fail under load (and so might the program).
In the field, bugs like this produce hard-to-reproduce failures. Unless you find ways to play with the system’s timing, you might never replicate problems like this in the lab. They just become mystery-failures that a few customers call to complain about.
If you are already load-testing a program and if you already have a regression suite, then it might be very easy to turn this into a long-sequence regression test running with a moderate-load load test in parallel. The oracles include whatever comes with each regression test (to tell you whether the program passed that test or not) plus crashes, obvious long delays, or warnings from diagnostics (if you add those in).
A Few More Notes on History
I think that most of the high-volume work has been done in industry and not published in the traditional academic literature. For example, the best-known (best-publicized) family of high-volume techniques are “fuzzing”, attributed to Professor Barton Miller at the University of Wisconsin in 1988.
- But I was seeing this industrial demonstrations of this approach when I first moved to Silicon Valley in 1983, and I’ve been told of applications as early as 1966 (the “Evil” program at Hewlett-Packard).
- Long-sequence regression testing was initially developed in 1984 or 1985. I played a minor role in developing Telenova’s diagnostics-based approach in 1987; my impression was that we were applying ideas already implemented years ago at other telephone companies.
- The testing staff at WordStar (a word-processing company) used a commercially available tool to feed long sequences of tests from one computer to another in 1984 or 1985. The first computer generated the tests and analyzed the test results. The second machine ran the program under test (WordStar). They were hooked up so that commands from the first machine looked like keyboard inputs to the second, and outputs that the second machine intended for its display actually went to the second machine for analysis. As an example of the types of bugs they found with this setup, they were able to replicate a seemingly-irreproducible crash that turned out to result from a memory leak. If you boldfaced a selection of text and then italicized it, there was a leak. If you italicized first, then applied bold, no leak. The test that exposed this involved a long random sequence of commands. I think this was a normal way to use that type of tool (long sequences with some level of randomization).
- We also used random input generators in the telephone world. I think of a tool called The Hammer but there were earlier tools of this class. Hammer Technologies was formed in 1991 and I was hearing about these types of tools back in the mid-to-late-1980’s while I was at Telenova.
I’ve heard of related work at Texas Instruments, Microsoft, AT&T, Rolm, and other telephone companies, in the testing of FAA-regulated systems, at some auto makers, and at some other companies. I’m very confident that the work actually done is much more broadly spread than this small group of companies and includes many other techniques than I’ve listed here. However, most of what I’ve seen has been by semi-private demonstrations or descriptions or via demonstrations and descriptions at practitioner conferences. Most of the interesting applications that I have personally heard of have involved firmware or other software that controls hardware systems.
As far as I can tell, there is no common vocabulary for these techniques. Much of what has been published has been lost because many practitioner conference proceedings are not widely available and because old papers describing techniques under nonstandard names are just not being found in searches. I hope that we’ll fill in some of these gaps over the next several months. But even for the techniques that we can’t find in publications, it’s important to recognize how advanced the state of the practice was 25 years ago. I think we are looking for a way to make sometimes-closely-held industrial practices more widely known and more widely adopted, rather than inventing a new area.
This post is partially based on work supported by NSF research grant CCLI-0717613 ―Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing. Any opinions, findings and conclusions or recommendations expressed in this post are those of the author and do not necessarily reflect the views of the National Science Foundation.