This blog is no longer being maintained, please go to The missing link of Agile instead

Wednesday, November 09, 2011

A serious take on Data-Rich Development - Part 1

“It's about taking the data that we have at hand in our development work and really using it.  If we are making a decision about whether to refactor a piece of code, we should be able to see its churn and complexity trends, and tie them back to events that happened over time, and the actual features which triggered the work.  Right now, it seems that we often look at our decisions through the pinhole of the present, ignoring what we can learn from our code's past.”
Michael Feathers, Data-Rich Development

A few weeks ago I had a very fruitful conversation with my colleague Bogdan Lachendro from the team at Sabre Poland. Yet again we started discussing Technical Debt and I can assure you it is not an easy topic if you consider project of our magnitude (15 years of development, millions line of code). The crux being that you want the best way there can be to identify code spots you need to fix, as your time is limited and the potential areas to clean up is almost infinite.

What to do... what to do…

You might think – boy-scout rule!... and yeah sure that’ll help, but on a humongous codebases it got a nasty habit of working rather like a shotgun – sometimes you fix the code that really needed it, and for the most of the time you don’t… continuing with a metaphor, if you shoot long enough you might get lucky and hit the target more often than not. So it’s fine to do it, yet IMHO it helps to keep the entropy from increasing, but it’s not really as helpful as you would want it to be.

At this point I can hear you screaming – code metrics bro’, code metrics! PMD, Sonar, FindBugs! Ok, cool – they’re fun to use, but if you’ve ever worked on a project of that size you know how that ends up. You get sooo many warnings that you don’t even know where to start. Again – cool stuff, but without extra cues, it won’t fly.

What we really needed, was a method that’d tell us how we can get the best bang for the buck. We needed to have something that’d help us deliver faster, easier, have less bugs in the software (preferably in the places that customers use most often). Yes I do mean all of it when I’m saying “reduce technical debt”, for the very reason that technical debt is not really so much “technical” as the name would suggest. At this point we already knew that we needed a whole truck of fortune cookies to tell us what that freakin’ business would want 10 years on from now and where our clients would find most bugs, so we can go ahead and clean it up… or, maybe, just maybe we could extrapolate from historical data. After all we are able to extract data for over 6 years of development and there should be enough information to get meaningful conclusion out of it.

And what we came up with was to use the information from issue tracking system1 (we use ClearQuest) and the source code (we used ClearCase and are currently on SVN) and consolidate it. That way we would knew which areas of the code were changed the most and could predict that they might be the main targets of modification in upcoming months (I’m using months, ‘cause I’m not really sure if much more time got left – because of that thing with Mayans and their calendar...). Sounds familiar, right? Yeah, I won’t lie to you; we were under strong influence of what Michael Feathers is preaching about for some time now (eg. Getting Empirical about Refactoring).

The hacking started and after not that long a time I managed to get my data out of ClearQuest (thank you CQPerl!) and combine both SVN and Clear Case information to get list of tuples in this format: [ISSUE_NUMBER, CHECKIN_DATE, FILENAME]. The next step was to visualize it somehow... actually I was pretty skeptic about using yet another line/bar/pie chart and craved for a picture which meaning just gets to you. In fact I strongly believe that in a today’s world it is more important than ever that you not only pick the right information from the data, but also choose a right way to present that, and with the right tools this can actually be pretty easy to achieve. After a moment of googling I came onto Protovis – really amazing visualization library that let you do very powerful graphics almost out-of-the-box. And after not more than 15 minutes the graph below was brought the world:

Histogram depicting number of defect fixes across which classes were changed (bubbles represent files, the size of bubble is directly proportional to number of defect fixes file was modified in)

I was not entirely honest with you… A few more little details about what I did:
  • filtered out files that does not appear in more than N issues,
  • made size of a bubble a quadratic function of number of issues that a file was modified in (ie. size_of_bubble = number_of_issues ^2),
  • grouped colors of bubbles by a package (or it might have also been a part of code from an original example, I don’t remember)
  • I don’t want to share too many details about my project, but trust me these are really the places in the code that you would expect to appear on this graph (and now we have data to defend to our position which is pretty cool).

I myself had two most obvious conclusions after looking at this “rich histogram”:
  • There is around 15-20 files that stand out (if you consider we have tens thousands of files in a VCS, this is pretty strong) – you probably start your technical debt discussion from this classes, ie. make them smaller, clean up the implementation, increase coverage,
  • files in an “orange” package are changed more often than other – this needs more analysis, on whether the package is that large, or they are often changed all together… and then maybe because of coupling

What next you can do with this graph? Possibilities are plenty… things that I did already or am planning to in a near future is to let user:
  • let user see histogram either on class level or on package level
  • filter out specific modules or packages
  • filter out test code
  • let user calculate the histogram only for selected date range

Now this graph alone does not tell you yet whether these are the files that needs your care, but it is one step forward from being completely ignorant about the historical context of changes in your codebase and its potential impact on the business in the future.

In the next posts I will continue to explore various ways you can use this information to exploit your data to the limits.


1) We might have also considered to pull the information VCS, but we used ClearCase for a long time and only recently switched to Subversion - and since on ClearCase you don’t really have a concept of a commit that was a blocker.

No comments:

Post a Comment