Starting this Fall, I will be teaching Applied Statistics to undergraduate students majoring in Computer Science and in Software Engineering. I am designing the course around real-life examples (published papers, conference talks, blogged case studies, etc.) of the use of probability and statistics in our field. I need examples.

I prefer examples that show an appropriate use of a model or a statistic, but I will be glad to also teach from a few papers that use a model or statistic in a clearly invalid way. If you send me a paper, or a link to one, I would appreciate it if you would tell me what you think is good about it (for a stats student to study) and why you think that.

### Background of the Students

The typical student in the course has studied discrete math, including some combinatorics, and has at least two semesters of calculus. Only some students will have multivariable calculus.

By this point in their studies, most of the students have taken courses in psychology, logic, programming (up to at least Algorithms & Data Structures), two laboratory courses in an experimental science (chemistry, physics, & biology), and some humanities courses, including a course on research sources/systems (how to navigate the research literature).

### My Approach to the Course

In one semester, applied stats courses often cover the concept of probability, discrete distributions, continuous distributions, descriptive statistics, basic inferential statistics and an introduction to stochastic models. The treatment of these topics is often primarily theoretical, with close attention to the underlying theorems and to derivation of the attributes of the distributions and their key statistics. Here are three examples of frequently-used texts for these courses. I chose these because you can see their table of contents on the amazon site.

- Johnson, Probability and Statistics for Computer Science
- Ross, Introduction to Probability and Statistics for Engineers and Scientists
- Trivedi, Probability and Statistics with Reliability, Queueing, and Computer Science Applications.

I’d rather teach a more applied course. Rather than working through the topics in a clear linear order, I’d like to start with a list of 25 examples (case studies) and teach enough theory and enough computation to understand each case.

For example, consider a technical support department receiving customer calls. How many staff do they need? How long will customers have to wait before their call is answered? To answer questions like these, students would learn about the Erlang distributions. I’d want them to learn some of the underlying mathematics of the distribution, the underlying model (see this description of the model for example) and how this practical model connects to the mathematical model, and gain some experience with the distribution via simulation.

### Examples Wanted

The biggest complaint that students (at Florida Tech and at several other schools, long ago) have voiced to me about statistics is that they don’t understand how the math applies to anything that they care about now or will ever care about. They don’t see the application to their field or to their personal work. Because of that, I have a strong preference for examples from Computer Science / Software Engineering / Information Security / Computer Information Systems.

Several of my students will also relate well to biostatistics examples. Some will work well with quantitative finance.

In the ideal course (a few years from now), a section of the class that focuses on a specific statistic, a specific model, or a specific type problem will have links to several papers that cover essentially the same concepts but in different disciplines. Once I have a computing-related primary example of a topic, I can flesh it out with related examples from other fields.

Please send suggestions either as comments on this blog or by email to me. A typical suggestion would include

- a link to a paper, a presentation, a blog post or some other source that I can reach and read.
- comments from you identifying what part of the material I should teach
- comments from you explaining why this is a noteworthy example
- any other notes on how it might be taught, why it is important, etc.

*Thanks for your help!*

Cem,

My father, William G. Hunter, was an applied statistics professor. He, and his mentor / long-time-collaborator, George Box thought statistics as subject was, in general, abysmally taught in universities. Their view was very similar to the one you have stated. Namely, that as applied statistics is typically taught students “don’t understand how the math applies to anything that they care about now or will ever care about. They don’t see the application to their field or to their personal work.”

One of my father’s approaches in dealing with that challenge was to have his students to design statistical experiments and report on their findings. “101 WAYS TO DESIGN AN EXPERIMENT, OR SOME IDEAS ABOUT TEACHING DESIGN OF EXPERIMENTS by William G. Hunter The University of Wisconsin-Madison.” Students designed experiments on an extremely wide-range of interesting topics that had personal significance to the students. http://www.stat.wisc.edu/techreports/tr413.pdf

My father was a modest, gentle man with a reputation for being a very good listener. Despite his soft-spoken personality, he would state at the beginning of his applied statistics courses that his goal was to make the course the most important class his students had ever experienced, giving them new skills that they would find extremely useful throughout their diverse careers. His students could tell he was serious about his boldly-stated intentions.

I’d highly recommend reading George Box’s recent auto-biography. An Accidental Statistician: The Life and Memories of George E. P. Box by http://www.amazon.com/dp/1118400887/ref=cm_sw_r_tw_dp_hsQWrb182YKTC In it, interspersed with amusing stories from his life, Box shares a lot of the lessons he learned as he sought to make applied statistics accessible and relevant to the students and practitioners he taught, worked-with, and advised.

More thoughts to come… The intersection of applied statistics and software testing is a subject I’m passionate about.

– Justin

I have more ideas to share

Justin:

Thank you. I’ve downloaded the article and ordered the book. I’m sure they’ll be useful.

– cem

I really enjoyed reading a recent white paper which modeled the codebase maintainability of FireFox. Written by ALI ALMOSSAWI. Paper is found here: http://almossawi.com/firefox/prose/

To add: I think it is valuable in showing students a real-world case regarding software they are all most likely familiar with. The paper specifically shows them how Ali Almossawi approached the subject of quantitatively modelling code maintainability, as well as what the results were. His writing is clear and concise, even though much of the maths were a bit beyond me (but probably not for your students, looking at their history.) Furthermore, it sheds light on an aspect of software quality that testers generally don’t concern themselves with that much, but that affects them more than they may realize (code stomp, regressions, wacky dependencies). There are a wide variety of statistics used, and many different methods of visual modelling.

In a world where agility, time-to-market, crowd/open sourcing, and rapid prototyping are taking over, it is good for people to be able to keep long-term maintainability in mind.

Whether your students go into development, testing, product management, or many other related fields, it helps to understand how code maintainability affects quality, agility, scalability, cost, etc.

This is a great idea. May I suggest an area of real world application without providing teaching templates?

I did my PhD in the area of Software Engineering (and went on to teach basic mathematics, so I feel a link there :)). Now I am in quality management – and there I am always confronted with the need to generate “numbers” (key performance indicators, testing statistics, …). However, I always get the feeling that the engineers and managers do not really have any idea what those numbers mean. This area MIGHT provide application examples for a statistics course. Sources for material could be quality management courses in classical engineering.

Also, the field of metrics might provide some examples.

(All my own course material is, unfortunately, in German – I’ll think about it and hope to return with more food for thought)

Even if my contribution was not so very elaborate: I appreciate your approach and wish you all the best for it!

What about probabilistic data structures like the cuckoo hash (or others)? Mitzenmacher’s, and Upfal’s “Probability and Computing: Randomized Algorithms and Probabilistic Analysis” has a nice intro to them.

Also, IBM still does orthogonal defect classification (when I worked there I helped clean the data), so it might be interesting to cover that: http://www.ibm.com/developerworks/rational/library/aug06/gu/ and http://www.chillarege.com/articles/odc-concept