Metrics, one last time

One last blog entry on metrics; so far I’ve received a lot of useful comments. It’s time to restate my position on metrics (as a guy who works on software metrics a fair bit): raw measurements, published without sufficient context, cause confusion and invite unfounded comparisons. That’s just because of how people see numbers: that apple scores 3, this pear scores 5 (I guess one is a Braeburn and the other a Doyenne de Comice) The unfounded comparison is that the pear is better than the apple.

Which brings us to the notion of comparability. Basically, the thing to note is you must compare apples to apples. There’s a blog post by Alex about LOC and an insightful comment by Zed Shaw exhorting to basically comparing things where comparison makes sense; context is crucial in such matters. (In Alex’ blog post, it becomes clear in the comments thread.)

Cases in point: comparing LOC of program A and program B. I think that you can sensibly claim that there is a relationship between LOC and maintainence effort; if A and B are written in the same language by similar programmers for similar functionalities (how to establish that is another matter) then you might conclude that A is easier to maintain than B if A’s SLOC count is smaller than B. Might be, because of corner cases such as perl one-liners. You can’t even translate SLOC(A) > SLOC(B) to mean “A is bigger than B” in any meaningful way: but the mere existence of such numbers is going to invite such a comparison.

Similarly, absolute numbers of bugs reported: invites a comparison that is meaningless without context; for Free Software projects the context includes the project’s culture for reporting bugs, which makes it kind of hard to adjust for. But Gartner is (or used to be) pretty good at playing the “X has more bugs than Y” game, in a pomegranate and grapefruit sense.

But now let’s turn to apples and apples. The gist of Zed’s comment is that it does make sense to watch metrics over time. Because we may assume that today’s A is quite similar to yesterday’s A (this goes for developers, KDE subprojects, etc.) This is a fair comparison: if I do 10 commits today, and did 2 yesterday, I might be more productive today than yesterday. This assumes that the kind of commits I do doesn’t change from day to day and we’ve already discussed the tenuous relationship between commits and actual productivity, but still: all else being equal, 10 is better than 2. Watching metrics over time also alerts us to sudden changes. If my commit rate falls to 0 and my mailing rate falls to 0, then possibly I’m on vacation. In a more knotty fashion: if the cyclomatic complexity of some piece of code suddenly goes up, we can think that it will require more testing or possibly some refactoring and algorithmic improvement in the near future, or that the code has changed in some fundamental way (new features or something). Again, we need more context to determine whether this is good or bad.

.. or do we?

Most likely “good” and “bad” are just the wrong words to be using. And I’m talking like a manager who is watching the process of software development like an ant farm – they’re doing something, but I don’t understand what or why. So let’s stop with that.

The reason I actually do source code metrics is that they can be used by the developers to show something that the developer may not even be aware of; discontinuities in the graphs are interesting rather than good or bad and it’s best if the developer reflects on what happened to see if there are any quality implications inthe discontinuity. For some metrics, it’s just nice to be able to point to distinguished events (like the green-blobs graph which shows developer activity per week; you can point out where coolo got married). Metrics can be employed to give developers an idea of where effort is needed, or – through counting simple errors – a place to get started.

In the end, within the KDE project – this is a reasonably apples-to-apples environment – the old adage of the EBN applies: if the numbers go down, our quality has gone up. When the numbers go up, we’re introducing new code. And when the numbers go up drastically, we’ve probably introduced a new tool to count a new class of errors that weren’t on the agenda previously.

The Wayback Machine ⏲ does not archive everything. Broken image links are marked with a 💔.