Gaming the System

Or, as a good subtitle: “the assertion that metrics-based software development is stupid.”

I’ve got a nice article somewhere in a Communications of the ACM about software metrics. And how using those metrics to drive development or evaluate performance is generally a bad idea. Generally. Not always. Much of it comes down to “gaming the system”, which I suppose I had best define as: In “gaming the system” the “system” refers to a system of a process and some measurement on that process and a rewards scheme based on the measurements; “gaming” is the practice of performing actions that result in favorable measurements, independent of whether those actions meet the overall goal of the system. You can see this happen in educational circles: when the system measures, say, student grades and financing is granted based on those grades, there is a strong pressure to raise grades and so obtain more financing, regardless of the grades the students should be getting.

There’s probably more on that in Freakonomics. That’s a pretty good book. Heck, there might even be more on it in Wikipedia.

But turning to software development: we produce software. But the word “produce” there is kind of weird, since software is intangible. When you’re producing cars or corn you can count the physical output; you can look at a partial product and have a good idea of how much effort is needed to complete the product. Not so with software. How can you count products?

This is why we end up not counting the products, but derivatives of the product itself: commits, lines of code, etc. But because none of these directly relate to the goal of the process, namely software, we open up the doors to gaming the system.

When such circumstances arise as to promt the financial remuneration, compensation, consideration and recompense of a writer, scribe or author’s efforts on the written page and in print it may be taken into consideration that a method of measurement of said activities and products which recompenses verbosity and not clarity of thought and expression may lead inadvertently and accidentally to over wordy, lengthy, tedious and altogether annoying prose. Or: paying authors by the word means they’ll write longer sentences.

In our little software world, paying by lines of code or commits is a bad idea, because it will prompt people to start producing more lines of code or commits regardless of whether this is good for the overall purpose of the process, i.e. good for producing the end software product. Here’s a good way to prop up your productivity, as a shell function:

commit() {
for i in `svn st | grep '^[AMD]' | awk '{print $2}'`
do
svn commit -m "$1" $i
done
}

Instead of committing a bunch of files at once, you commit each one individually (all with the same commit message). Suddenly one atomic commit becomes five! That’s a five-fold productivity gain right there.

Waaaait a minute, you say. That’s not productivity! Well, not in the common sense of the word, but within this system which pays by the commit, it is . And so you game the system. Same goes for lines of code, and much effort has been put into “correcting” lines of code counts to be more accurate as an indicator of overall productivity.

But it remains an indicator, no more. You can always do manual loop unrolling if you need more money within a system that pays for SLOC.

One of the few things that can do a reasonable job of measuring software productivity is function point analysis. Based on lots of experience in managing software, you chop the required functionality into function points, assign an amount of time to each, and developers work on the function points and tick them off as they are completed. This does a pretty good job of measuring productivity, but even in commercial software development there’s about an order of magnitude difference in function points per hour between different projects within the same organization (again, from memory out of a CACM issue). Closer analysis of the numbers is always needed if you want to steer future development based on these numbers. And let’s face it, the whole premise is hopelessly flawed in the anarchic community based development that characterizes KDE.

So we get back to counting the things that we can count. Because developers do produce lines of code and commits. If that’s all that is countable, then – here the old adage about hammers and nails applies – then that’s what you count.

Note that within a project (especially a gigantic umbrella project like KDE) I will never suggest that “more productive” (based on what we can measure) means “better” or “more useful.” Because productivity is based on something derivative of the actual useful effort put into the project – the purpose is to produce cool software, and we can’t measure that effectively. So for the sprints considered recently – SQO, Plasma, and there’s KDevelop too and Anne-Marie mentions the KDEEdu sprint – with this measure of productivity we can just crunch the numbers and draw a graph. But none of that says anything about the purpose, use or success of such sprints, except within the system we’re working in.

We could – and the KDE e.V. does, in a way, by encouraging each sprint or developer meeting to introduce new people to the project – use a different measure of productivity for sprints. We could count the number of person-pairs (a,b) physically present at a sprint that have never previously been in one room together. The SQO sprint scores 0 on that count, Plasma considerably higher. We can game that system by bringing in random hobos from the street.

So whenever you are measuring derivatives of a process and not the actual product or goal of that process – and I will argue that for Free Software development this is the case we have to live with – you can game the system and need to live with it.

So where does this leave us with SQO-OSS? We are measuring stuff about things. The things being Open Source software projects, the stuff being, well, .. being stuff. And the stuff includes numbers about the process, such as numbers of commits, active SVN accounts, growth of SLOC, number of reported bugs. It’s really important to remember that the raw numbers are not really comparable between projects or even between different parts of (umbrella) projects. I’m aware that bad comparisons will inevitably follow (case in point: introduce a metric that counts the letter F in source code, and conclude that fvwm is better than postgres because the numbers are bigger). Here careful scientific study – and that doesn’t happen on blogs, but in papers and conferences – is needed to amalgamate and normalize the numbers into something that is at least somewhat comparable.

There’s a European project – not SQO-OSS – to bring together a number of existing Open Source quality models. That’s the kind of long-term research that is needed to be able to do a qualitative comparison between software products and we expect that SQO-OSS will deliver the kind of numbers needed to evaluate such a shared quality model. However, qualitative assessment will be needed for a long time to come to tune the models and validate their results.

So the short summary: using the word “productivity” to denote one derivative product of the process that is sprinting is confusing, but in the world of software metrics we can’t do much better than that without having a much better grasp of the quantifiable and measurable goals of such processes.

And, as an addendum: I have a SQO-OSS T-shirt that says “commit”. It’s because I do the most commits in SQO-OSS. Am I the most productive? Yes, says the metric. Am I the most useful person in the project? Who knows. We’re all in this together. But looking at the qualitative aspects of my commit behavior, I’m into small atomic commits much more than others in the project. I will happily commit typo-fixes every half hour while going over a document; others do the whole document as one commit. That’s a workflow difference that games the system even when it’s not intentional (or productive, since I’m not paid per commit).