This blog is no longer being maintained, please go to The missing link of Agile instead

Monday, November 28, 2011

A serious take on Data-Rich Development - Part 2

In the last part I explained how I managed to prepare the histogram of class changes across issues. This was certainly not everything that you can do with the data that I generated. The next step was to get more detailed information on each class. I decided that for sure I want to see how number of modifications changed over time and to achieve this I prepared a chart you can see below; it shows how many issues “changed” this class at a specific moment in time (month being the resolution)1:

Note that in this case I decided to use standard line chart since the information that I wanted to present is not really that complex – I certainly don’t want the “cool charts” to become my golden hammer. It is also worth noting that after using this data for some time I decided to normalize Y-axis, so I can compare metric on different classes at a glance. See for instance second example below, where you can clearly see that over time number of both modifications and defects decreased – it’s not marked as sharply as on stock market, but if you see a number of these charts (one above being an example), the existence of a trend becomes obvious.

It’s still hard to say whether this trend means good or bad – it is obvious that this class doesn’t get changed, but it still can mean that certain problems exist, eg. the class got too big and rarely any modification is justifiable by a business reason.  To nail down the cause, you’d need other metrics to provide you more context. I’ve been already thinking of possible extensions that could exploit more sophisticated ways of visualization, when for example one could be able to see how size of a class changes on the same timeline. Theoretically it would be awesome to see coverage there as well but this is unfortunately not possible in our case – if you however can do it, go for it, I’d be thrilled to see that. And if I were to choose how to visualize that I’d probably go for a chart a’la GapMinder which sort of out of the box would enable having combined view of metrics for many classes at the same time. Anyhow, I’m not sure which information would benefit you the most, but it’s very much worth exploring :)

I was not planning to explore this problem in an orderly fashion of any sort, because when I research a topic I like to do a bit of jumping from one thing to another which helps me get a better grasp on all aspects of the particular problem). I decided that for a next step I want to go more into the correlations of classes across issues (I again got inspired my Michael Feathers – The first visualization that got created out of this concept was a graph of all possible correlations (ie. classes that get changed together as a part of the same issue) above certain threshold for whole project and it looked like that:

The size of a node (representing a class) is proportional to the aggregate number of issues when a class was changed together with other class, and correlation is depicted as a link between the two. This visualization certainly looks cool and also is interactive – you can pan the area, zoom, move nodes around… actually look foryourself on Protovis site. What’s the downside then? There’s just too much information – it does show you an overview of areas having a strong coupling (see yellow), it will highlight boundaries of application modules (see blue), but it’s almost impossible to get more specific information out of it. So it’s good as a start but you need a next step here, something that would let you dig into the details why the situations is as it is and whether you should do something about it.

I’m planning on describing ways of resolving this in Part 3 so for now let me just show you another way of visualizing the same information. What I’m going to present is IMHO much more useful when you need to focus on the correlations (especially identify where they don’t make any sense) rather than classes (correlations lower than 3 were filtered out):

The concept is quite similar to a previous one: nodes are classes, links are correlations. Then, around the whole circle classes are positioned in a specific sort order – by package name. Having them in such an order let’s you apply a simple heuristic – whenever there’s a link between a two remote locations of a circle there is potential unnecessary coupling between two separate packages… and while there may be a relationship in the code it’s at least suspicious if these classes get changed together too often (change frequency is represented by color, increasing from green to red). On the other hand even if the correlations are close (in the same package), but there are lots of it, it still can have negative meaning – for example the package may be too large. I didn’t play much with this visualization so there may be many other ways of analyzing and getting valuable information out of it. Moreover there’s an amazing tool for doing much morepowerful visualization of this kind, and as soon as I learn how to use it, I’ll write more on its potential.

In the next part… right, I’m not gonna lie to you, I have absolutely no clue about the next part, besides that there’s going to be one. Maybe I’m going to get more into the detail how I decided to present information for a single class… or maybe I’ll describe possible use cases to you can employ these charts for… or something totally different. Not sure – stay tuned.

1.  If you happen to have the length and scope of different issues wide-spreaded this metric will count all changes within single issue as one and result in overestimating importance of “quick fixes” and underestimating “long enhancements”. Because of that recently I modified this metric not to count all modifications in a single issue as one, but instead do it per-day basis. Then if a file is modified many times on different days, the number of days when it’s modified is the number we’re looking for.

No comments:

Post a Comment