Analyzing an SVN Log for Code Churn

For this activity we will write a Ruby script churn.rb to examine a svn repository log and identify those parts of a project that have experienced the most activity or “code churn” over a given period of time.

When testing a product just prior to release, testing resources need to be allocated to best use the finite amount of time before the release. One approach is to focus testing efforts on those parts of the project that have changed the most frequently since the last release. We can gather this information by examining the svn history of each project directory and count the number of file updates, or churn, generated each time a file was committed. There may be good reasons why a file was committed a number of times, but based on Murphy’s Law it is not unreasonable to draw the conclusion that every time you touch a file you have good chance of screwing something up! It is also useful in identifying opportunities for refactoring.

You can view get the svn directory listing of a repository using svn list’. Here is an example using an open source Small Device C Compiler called sdcc- https://sdcc.svn.sourceforge.net/svnroot/sdcc/trunk/sdcc/src/

(Note: I have excluded some of the actual output for brevity)

>svn list https://sdcc.svn.sourceforge.net/svnroot/sdcc/trunk/sdcc/src/
avr/
ds390/
ds400/
c08/
izt/
mcs51/
pic14/
pic16/
port-clean.mk
port.h
port.mk
regression/
reswords.gperf
src.vcxproj
src.vcxproj.filters
version.awk
xa51/
yacc.vcxproj
z80/

You can view the history of any part of the project using the ‘svn log’ command. (See: http://svnbook.red-bean.com/en/1.0/ch03s06.html) Here is an example using the above z80/ directory (again excluding some of the actual output):

svn log --revision 'HEAD:{2010-07-30}' https://sdcc.svn.sourceforge.net/svnroot/sdcc/trunk/sdcc/src/z80
------------------------------------------------------------------------

r5948 | spth | 2010-08-26 13:23:32 -0400 (Thu, 26 Aug 2010) | 1 line
Fixed #3052514

------------------------------------------------------------------------

r5920 | MaartenBrock | 2010-08-08 16:41:35 -0400 (Sun, 08 Aug 2010) | 1 line
* src/z80/gen.c (genCmp): no PO flag on GBZ80

------------------------------------------------------------------------

r5919 | MaartenBrock | 2010-08-08 14:55:12 -0400 (Sun, 08 Aug 2010) | 3 lines
* src/SDCCval.c (valMinus): applied patch from bug 3037889 though not fixed, thanks Patrik Persson
* src/z80/gen.c (genCmp): fixed bug 3041519

------------------------------------------------------------------------

r5918 | MaartenBrock | 2010-08-08 08:53:43 -0400 (Sun, 08 Aug 2010) | 14 lines
* sim/ucsim/z80.src/inst.cc (inst_add): moved add_HL_Word to z80mac.h,
(inst_daa, inst_scf, inst_ccf, inst_sub): fixed flags
(inst_jp): fixed JP PO/PE
* sim/ucsim/z80.src/inst_ed.cc (inst_ed): fixed flags
* sim/ucsim/z80.src/inst_xd.cc (inst_Xd_add): moved add_IX_Word to z80mac.h
* sim/ucsim/z80.src/instcl.h: cosmetics
* sim/ucsim/z80.src/z80.cc (print_regs): print N flag
* sim/ucsim/z80.src/z80mac.h: fixed flags
* src/z80/gen.c (genIfxJump): simplified(genCmp): fixed bug 1757671
* device/include/stdbool.h: __SDCC_WEIRD_BOOL==2 for hc08/pic14/pic16
* support/regression/tests/bool.c: run test only once,run half the cases for __bit
* support/regression/tests/bug1757671.c: new, added


To start with, the script we are writing will be specific (hard coded) to one project and a selected level of directories. At a later time it could be extended to be more generic and find the subdirectories automatically. We could extend it even further to dive down to the file level and produce reports for the most active files in addition to directories, but we’ll save those features for a later version.

As a hint to start, do a manual svn list of the project you are analyzing and create an array of the lowest level directory names you will be looking in for code churn. SourceForge and RubyForge also provide repository browsers for navigating through the project. In the above example you would create an array similar to this:

subsystems = [‘avr’, ‘ds390’, ‘xa51’, ‘z80’]

You would then iterate through the svn repository by appending the directory to the path name. Note that you can execute any OS command in Ruby using back tick marks (same key as tilde ~) as shown here and optionally capture the output of the command as a string or process it directly:

result = `svn log --revision 'HEAD:{2010-07-30}' https://sdcc.svn.sourceforge.net/svnroot/sdcc/trunk/sdcc/src/z80`

The script should accept one parameter (command line argument) which is the starting timestamp from which to count repository modifications:

> ruby churn.rb 2010-01-02

Validate the timestamp for format and a time that is not greater than the current date/time. Display an error message and exit the script if the timestamp is invalid. Great opportunity for unit tests!

Output is in histogram format with “***” showing relative ranking in descending order, along with actual count “(nn}” as shown in the example below for the sdcc project. Note these are not the actual counts for each directory; this is just a representation of the output.

$ ruby churn.rb 2010-12-01
Changes since 2010-12-01:
avr ********** (45)
ds390 *** (19)
xa51 *** (17)
z80 * (2)

Submit your solution to the RubyChurn directory via Git.

Additional mining to try if after you complete the above assignment:

What specific module names are modified the most?
What developer (svn author) has made the most changes?
Try this script on your SE361 project.