IT Realms: 2011

Monday, November 28, 2011

A serious take on Data-Rich Development - Part 2

In the last part I explained how I managed to prepare the histogram of class changes across issues. This was certainly not everything that you can do with the data that I generated. The next step was to get more detailed information on each class. I decided that for sure I want to see how number of modifications changed over time and to achieve this I prepared a chart you can see below; it shows how many issues “changed” this class at a specific moment in time (month being the resolution)¹:

Note that in this case I decided to use standard line chart since the information that I wanted to present is not really that complex – I certainly don’t want the “cool charts” to become my golden hammer. It is also worth noting that after using this data for some time I decided to normalize Y-axis, so I can compare metric on different classes at a glance. See for instance second example below, where you can clearly see that over time number of both modifications and defects decreased – it’s not marked as sharply as on stock market, but if you see a number of these charts (one above being an example), the existence of a trend becomes obvious.

It’s still hard to say whether this trend means good or bad – it is obvious that this class doesn’t get changed, but it still can mean that certain problems exist, eg. the class got too big and rarely any modification is justifiable by a business reason. To nail down the cause, you’d need other metrics to provide you more context. I’ve been already thinking of possible extensions that could exploit more sophisticated ways of visualization, when for example one could be able to see how size of a class changes on the same timeline. Theoretically it would be awesome to see coverage there as well but this is unfortunately not possible in our case – if you however can do it, go for it, I’d be thrilled to see that. And if I were to choose how to visualize that I’d probably go for a chart a’la GapMinder which sort of out of the box would enable having combined view of metrics for many classes at the same time. Anyhow, I’m not sure which information would benefit you the most, but it’s very much worth exploring :)

I was not planning to explore this problem in an orderly fashion of any sort, because when I research a topic I like to do a bit of jumping from one thing to another which helps me get a better grasp on all aspects of the particular problem). I decided that for a next step I want to go more into the correlations of classes across issues (I again got inspired my Michael Feathers – http://michaelfeathers.typepad.com/michael_feathers_blog/2011/09/temporal-correlation-of-class-changes.html). The first visualization that got created out of this concept was a graph of all possible correlations (ie. classes that get changed together as a part of the same issue) above certain threshold for whole project and it looked like that:

The size of a node (representing a class) is proportional to the aggregate number of issues when a class was changed together with other class, and correlation is depicted as a link between the two. This visualization certainly looks cool and also is interactive – you can pan the area, zoom, move nodes around… actually look foryourself on Protovis site. What’s the downside then? There’s just too much information – it does show you an overview of areas having a strong coupling (see yellow), it will highlight boundaries of application modules (see blue), but it’s almost impossible to get more specific information out of it. So it’s good as a start but you need a next step here, something that would let you dig into the details why the situations is as it is and whether you should do something about it.

I’m planning on describing ways of resolving this in Part 3 so for now let me just show you another way of visualizing the same information. What I’m going to present is IMHO much more useful when you need to focus on the correlations (especially identify where they don’t make any sense) rather than classes (correlations lower than 3 were filtered out):

The concept is quite similar to a previous one: nodes are classes, links are correlations. Then, around the whole circle classes are positioned in a specific sort order – by package name. Having them in such an order let’s you apply a simple heuristic – whenever there’s a link between a two remote locations of a circle there is potential unnecessary coupling between two separate packages… and while there may be a relationship in the code it’s at least suspicious if these classes get changed together too often (change frequency is represented by color, increasing from green to red). On the other hand even if the correlations are close (in the same package), but there are lots of it, it still can have negative meaning – for example the package may be too large. I didn’t play much with this visualization so there may be many other ways of analyzing and getting valuable information out of it. Moreover there’s an amazing tool for doing much morepowerful visualization of this kind, and as soon as I learn how to use it, I’ll write more on its potential.

In the next part… right, I’m not gonna lie to you, I have absolutely no clue about the next part, besides that there’s going to be one. Maybe I’m going to get more into the detail how I decided to present information for a single class… or maybe I’ll describe possible use cases to you can employ these charts for… or something totally different. Not sure – stay tuned.

1. If you happen to have the length and scope of different issues wide-spreaded this metric will count all changes within single issue as one and result in overestimating importance of “quick fixes” and underestimating “long enhancements”. Because of that recently I modified this metric not to count all modifications in a single issue as one, but instead do it per-day basis. Then if a file is modified many times on different days, the number of days when it’s modified is the number we’re looking for.

Wednesday, November 09, 2011

A serious take on Data-Rich Development - Part 1

“It's about taking the data that we have at hand in our development work and really using it. If we are making a decision about whether to refactor a piece of code, we should be able to see its churn and complexity trends, and tie them back to events that happened over time, and the actual features which triggered the work. Right now, it seems that we often look at our decisions through the pinhole of the present, ignoring what we can learn from our code's past.”

Michael Feathers, Data-Rich Development

A few weeks ago I had a very fruitful conversation with my colleague Bogdan Lachendro from the team at Sabre Poland. Yet again we started discussing Technical Debt and I can assure you it is not an easy topic if you consider project of our magnitude (15 years of development, millions line of code). The crux being that you want the best way there can be to identify code spots you need to fix, as your time is limited and the potential areas to clean up is almost infinite.

What to do... what to do…

You might think – boy-scout rule!... and yeah sure that’ll help, but on a humongous codebases it got a nasty habit of working rather like a shotgun – sometimes you fix the code that really needed it, and for the most of the time you don’t… continuing with a metaphor, if you shoot long enough you might get lucky and hit the target more often than not. So it’s fine to do it, yet IMHO it helps to keep the entropy from increasing, but it’s not really as helpful as you would want it to be.

At this point I can hear you screaming – code metrics bro’, code metrics! PMD, Sonar, FindBugs! Ok, cool – they’re fun to use, but if you’ve ever worked on a project of that size you know how that ends up. You get sooo many warnings that you don’t even know where to start. Again – cool stuff, but without extra cues, it won’t fly.

What we really needed, was a method that’d tell us how we can get the best bang for the buck. We needed to have something that’d help us deliver faster, easier, have less bugs in the software (preferably in the places that customers use most often). Yes I do mean all of it when I’m saying “reduce technical debt”, for the very reason that technical debt is not really so much “technical” as the name would suggest. At this point we already knew that we needed a whole truck of fortune cookies to tell us what that freakin’ business would want 10 years on from now and where our clients would find most bugs, so we can go ahead and clean it up… or, maybe, just maybe we could extrapolate from historical data. After all we are able to extract data for over 6 years of development and there should be enough information to get meaningful conclusion out of it.

And what we came up with was to use the information from issue tracking system¹ (we use ClearQuest) and the source code (we used ClearCase and are currently on SVN) and consolidate it. That way we would knew which areas of the code were changed the most and could predict that they might be the main targets of modification in upcoming months (I’m using months, ‘cause I’m not really sure if much more time got left – because of that thing with Mayans and their calendar...). Sounds familiar, right? Yeah, I won’t lie to you; we were under strong influence of what Michael Feathers is preaching about for some time now (eg. Getting Empirical about Refactoring).

The hacking started and after not that long a time I managed to get my data out of ClearQuest (thank you CQPerl!) and combine both SVN and Clear Case information to get list of tuples in this format: [ISSUE_NUMBER, CHECKIN_DATE, FILENAME]. The next step was to visualize it somehow... actually I was pretty skeptic about using yet another line/bar/pie chart and craved for a picture which meaning just gets to you. In fact I strongly believe that in a today’s world it is more important than ever that you not only pick the right information from the data, but also choose a right way to present that, and with the right tools this can actually be pretty easy to achieve. After a moment of googling I came onto Protovis – really amazing visualization library that let you do very powerful graphics almost out-of-the-box. And after not more than 15 minutes the graph below was brought the world:

Histogram depicting number of defect fixes across which classes were changed (bubbles represent files, the size of bubble is directly proportional to number of defect fixes file was modified in)

I was not entirely honest with you… A few more little details about what I did:

filtered out files that does not appear in more than N issues,
made size of a bubble a quadratic function of number of issues that a file was modified in (ie. size_of_bubble = number_of_issues ^2),
grouped colors of bubbles by a package (or it might have also been a part of code from an original example, I don’t remember)
I don’t want to share too many details about my project, but trust me these are really the places in the code that you would expect to appear on this graph (and now we have data to defend to our position which is pretty cool).

I myself had two most obvious conclusions after looking at this “rich histogram”:

There is around 15-20 files that stand out (if you consider we have tens thousands of files in a VCS, this is pretty strong) – you probably start your technical debt discussion from this classes, ie. make them smaller, clean up the implementation, increase coverage,
files in an “orange” package are changed more often than other – this needs more analysis, on whether the package is that large, or they are often changed all together… and then maybe because of coupling

What next you can do with this graph? Possibilities are plenty… things that I did already or am planning to in a near future is to let user:

let user see histogram either on class level or on package level
filter out specific modules or packages
filter out test code
let user calculate the histogram only for selected date range

Now this graph alone does not tell you yet whether these are the files that needs your care, but it is one step forward from being completely ignorant about the historical context of changes in your codebase and its potential impact on the business in the future.

In the next posts I will continue to explore various ways you can use this information to exploit your data to the limits.

1) We might have also considered to pull the information VCS, but we used ClearCase for a long time and only recently switched to Subversion - and since on ClearCase you don’t really have a concept of a commit that was a blocker.