A few weeks ago, Paul Adams (one of the folks whose innuendo I shall miss most in two weeks time) wrote about his newest oracle, which calculated something based on SVN logs. this Oracle of Ervin expresses a notion of connectedness in a group of developers working on shared resources.
That Oracle (for once in my blogging, not that database company) is related to Erdös numbers, and to Hirsch indexes) reminds me of some work I once did on software quality where we defined an "eyeball number" which was the number of distinct developers who worked on a given file. By fiddling around with the eyeball number modified by other factors we hoped to get some interesting indicator of software quality.
Indicator is the operative word here.
I think I can sum up (ha!) Paul's Oracle of ERVIN as follows: given a set of developers D0...Dn and a set of shared resources (e.g. Files) create a graph with nodes D0...Dn and arcs (Di,Dj) if both Di and Dj have modified a resource R in the time period under consideration. If the resulting graph is connected, carry on. Otherwise, bail out with "OMG this community is not connected but rather like grains of loose sand!" Or perhaps redefine the set of developers to make the community connected.
Now we have a connected graph define Dij as the hop count from node Di to Dj (clearly Dii=0, Dij=1 iff an arc exists ie. They both modified the same software artifact). Finally define the connectedness Ci of developer Di as the average of Dij for all j.
(Drat Wordpress and its lack of LaTeX support, btw)
After all that we have an indicator of .. something. If someone has a Ci close to 1 then that person has worked on a resource R that nearly every other person Dj has worked on (R may vary with Dj).
There is an obvious way to game this system which is called CHANGELOG. Of course everyone maintains that so in theory all developers work on that and so Dij=1 for everyone. Let's ignore that resource then - there are plenty of heuristic ways to tighten up the set of resources R so we get something more meaningful.
Ah, but what does it mean?
Or, what does it mean?
I think where Paul went off the rails is only in calling Ci a measure of collaboration and using it to identify core contributors. Well, it might indicate (point at) some people as core contributors, but interpretation and further study - further data - is needed. So giving this indicator the right name is of some importance.
Aside #1: if I am a terrible coder and keep committing atrocities in the codebase and all other contributors fix up my abominations, then I might end up with a Ci (i=adridg) of 1. But then I would be the core of the problem, not the solution. Interpretation is needed - you may have to squint.
Aside #2: at work I was dealing with quality indicators (waiting times for patients and throughput for various conditions) and ran into similar problems as this indicator. What it's called matters. How it is presented matters. Formulating the definitions in a way that is both understandable and valid is tricky. Get any of them wrong and you end up with plenty more of long discussions and reiterated definitional phases.
So where is all this headed? Certainly it illustrates that single-metric definitions rarely yield unambiguous results for concepts as ambiguous as "collaboration" or "core contributor". This touches on the ALERT project (mentioned on the Dot and by Stuart). The system being designed there is about "stuff that's happening that is inteesting for me", collected, filtered and presented automatically. There is no single metric to do that - rather, clustering and heuristic approaches are needed. As the project goes on, there will be half-baked illustrations (mixed metaphor: should be scribbles) of what it finds. What I take away from Paul's indicator is that presentation and interpretation are going to be very important. Also that critiquing an indicator is an art in itself.
So there you have it: tl;dr (length metric=N, formulas count=4, boringosity=17.6) and it means marketing metrics matters.