Software Archeology @ RIT

[ar·che·ol·o·gy] n. the study of people by way of their artifacts
Week 15 - Git Log Loader Enhancements

06 Dec 2013

We have ventured a long way from when we started the project. Since the beginning we have been building and reinforcing a strong software foundation. We are investing more time up front so that we can ensure we maximize not only the amount of data we collect but maximize the integrity, depth, and complexity of the data. We achieve this maximization via foundational enhancements to our modules and components. The module we will be focusing on in this blog post will be our Git Log Loader and the enhancements integrated into it.

Git Log Loader - Speed

Up until this point our Git Log loader has been implemented and utilized in collecting data. We’ve ran several tests on the loader including running it against a small sample file with only a handful of git commits then running it on a file with all git commits that have been created since the beginning of the chromium project. What we discovered is that the time it took was far too long. The size of the file with all the git commits was 1 MB. The time it took to process on my local machine was a little over a hour and a half which is unpractical. So in order to do this faster we first looked to optimizing the code. After making optimizations we found that the change in processing time was not significant enough. So our next stop was open-source.

We discovered a tool called “ActiveRecord-import”. This is a ruby gem used on top of the already powerful Object-Relational Mapping (ORM) framework called activerecord. The key to understanding the enhancement here is how ActiveRecord was importing these commits. ActiveRecord would only allow for inserting single entries at a time into our Mysql database. The ActiveRecord-Import gem allows for batch importing. This brought down the processing time down to double digit minutes.

This enhancement to the Git Log Loader acted as a proof of concept for the remaining loaders as well as future loaders to come.

Git Log Loader - Integrity

The next area of emphasis is the integrity of the data we were pulling in using the loader. The question we are asking ourselves here is, “Can we be certain that the data being collected is correctly represented and pulled into our model”. The initial answer to this question is no. The complete commit file contains thousands and thousands of commits. It is not certain that the text processing and the model importing is correctly happening. The fact of the matter is if we improperly collect even a single piece of information then that could lead to an overall representation of the data and therefore falsify our statistical outcomes and results. To combat this uncertainty we are utilizing the integrity check framework created by our fellow research partner Alberto Rodriguez.

This framework allows us write tests against the results of our loaders. These tests will allow us to eliminate a lot of those uncertainities we may have. This will not eliminate 100% of the possible errors that may occur but it will get us as close as possible to it. An example of how the framework assisted us with loader is verifying the correctness of the filepaths listed in the commits. We had to ensure that the filepath formats were correct and complete in order for us to later use those paths to traverse the Chromium project source tree.

Goals

Looking forward there are several key things we must get done. Overall these tasks can be broken down in 3 areas. These areas include investigating and fixing bugs; so far we have progressed with little bugs but we do not want to start now. Next area includes making and verifying relationships between models; through these relationships we can eventually obtain strong statistical results and information. Finally, creating more integrity checks via verifys in order to ensure that we are moving along strong with confidence.

« Home