“It's about taking the data that we have at hand in our development work and really using it. If we are making a decision about whether to refactor a piece of code, we should be able to see its churn and complexity trends, and tie them back to events that happened over time, and the actual features which triggered the work. Right now, it seems that we often look at our decisions through the pinhole of the present, ignoring what we can learn from our code's past.”
Michael Feathers, Data-Rich Development
A few weeks ago I had a very fruitful
conversation with my colleague Bogdan Lachendro
from the team at Sabre Poland. Yet again we started
discussing Technical Debt and I can assure you it is not an easy topic if you
consider project of our magnitude (15 years of development, millions line of
code). The crux being that you want the best way there can be to identify code spots
you need to fix, as your time is limited and the potential areas to clean up is
almost infinite.
What to do... what to do…
You might think – boy-scout rule!...
and yeah sure that’ll help, but on a humongous codebases it got a nasty habit
of working rather like a shotgun – sometimes you fix the code that really
needed it, and for the most of the time you don’t… continuing with a metaphor, if
you shoot long enough you might get lucky and hit the target more often than
not. So it’s fine to do it, yet IMHO it helps to keep the entropy from increasing,
but it’s not really as helpful as you would want it to be.
At this point I can hear you
screaming – code metrics bro’, code metrics! PMD, Sonar, FindBugs! Ok, cool –
they’re fun to use, but if you’ve ever worked on a project of that size you
know how that ends up. You get sooo many warnings that you don’t even know
where to start. Again – cool stuff, but without extra cues, it won’t fly.
What we really needed, was a
method that’d tell us how we can get the best bang for the buck. We needed to
have something that’d help us deliver faster, easier, have less bugs in the
software (preferably in the places that customers use most often). Yes I do
mean all of it when I’m saying “reduce technical debt”, for the very reason that
technical debt is not really so much “technical” as the name would suggest. At
this point we already knew that we needed a whole truck of fortune cookies to
tell us what that freakin’ business would want 10 years on from now and where
our clients would find most bugs, so we can go ahead and clean it up… or,
maybe, just maybe we could extrapolate from historical data. After all we are
able to extract data for over 6 years of development and there should be enough
information to get meaningful conclusion out of it.
And what we came up with was to use
the information from issue tracking system1 (we use ClearQuest) and
the source code (we used ClearCase and are currently on SVN) and consolidate
it. That way we would knew which areas of the code were changed the most and
could predict that they might be the main targets of modification in upcoming
months (I’m using months, ‘cause I’m not really sure if much more time got left
– because of that thing with Mayans and their calendar...). Sounds familiar, right?
Yeah, I won’t lie to you; we were under strong influence of what Michael
Feathers is preaching about for some time now (eg. Getting Empirical about Refactoring).
The hacking started and after not
that long a time I managed to get my data out of ClearQuest (thank you CQPerl!)
and combine both SVN and Clear Case information to get list of tuples in this
format: [ISSUE_NUMBER, CHECKIN_DATE, FILENAME]. The next step was to visualize
it somehow... actually I was pretty skeptic about using yet another
line/bar/pie chart and craved for a picture which meaning just gets to you. In
fact I strongly believe that in a today’s world it is more important than ever
that you not only pick the right information from the data, but also choose a
right way to present that, and with the right tools this can actually be pretty
easy to achieve. After a moment of googling I came onto Protovis
– really amazing visualization library that let you do very powerful graphics almost
out-of-the-box. And after not more than 15 minutes the graph below was brought
the world:
Histogram depicting number of defect fixes across which classes were changed (bubbles represent files, the size of bubble is directly proportional to number of defect fixes file was modified in) |
I was not entirely honest with you… A few more little details about what I did:
- filtered out files that does not appear in more than N issues,
- made size of a bubble a quadratic function of number of issues that a file was modified in (ie. size_of_bubble = number_of_issues ^2),
- grouped colors of bubbles by a package (or it might have also been a part of code from an original example, I don’t remember)
- I don’t want to share too many details about my project, but trust me these are really the places in the code that you would expect to appear on this graph (and now we have data to defend to our position which is pretty cool).
I myself had two most obvious
conclusions after looking at this “rich histogram”:
- There is around 15-20 files that stand out (if you consider we have tens thousands of files in a VCS, this is pretty strong) – you probably start your technical debt discussion from this classes, ie. make them smaller, clean up the implementation, increase coverage,
- files in an “orange” package are changed more often than other – this needs more analysis, on whether the package is that large, or they are often changed all together… and then maybe because of coupling
What next you can do with this
graph? Possibilities are plenty… things that I did already or am planning to in
a near future is to let user:
- let user see histogram either on class level or on package level
- filter out specific modules or packages
- filter out test code
- let user calculate the histogram only for selected date range
Now this graph alone does not
tell you yet whether these are the files that needs your care, but it is one
step forward from being completely ignorant about the historical context of
changes in your codebase and its potential impact on the business in the future.
In the next posts I will continue
to explore various ways you can use this information to exploit your data to
the limits.
1) We might have also considered to pull the information VCS, but we used ClearCase for a long time and only recently switched to Subversion - and since on ClearCase you don’t really have a concept of a commit that was a blocker.
No comments:
Post a Comment