Week 12 - Loading the data

14 Nov 2013

We’ve made a lot of progress over the last few weeks. Danielle addresses some of our configuration problems in her last post, but since then we have been able to sort out the problems. Our switch to PostgreSQL has been successful, as well as our decision to use Rails and ActiveRecord. Our progress since then has been focused in a few keys areas:

Scraper

The scraper is finally finished! The objective of the scraper is to gather all of the code review data available for the general public off of the Chormium website. We will get that in the form of a JSON file. We have already scraped about a thousand code reviews to use as test data.

Loaders

One of our main goals these last few weeks has been creating “loaders”. We use loaders to take the JSON data collected by the scraper and process it into our database schema. So far we have loaders for:

Code Reviews - This is our main loader. This loader goes through the JSON picked off of the chromium website by the scraper, and populates a number of tables based on this data. These include Developers, OWNERs, patch-sets, comments, and messages.
Git Log commit files - This loader takes in all the commit files from the git log and puts relevant data into our Commits datatable. Relevant data could include: author email and name, bug the commit fixed, people who reviewed the commit, etc.
CVE numbers - CVE numbers are “common identifiers for publicly known information-security vulnerabilities in publicly released software packages”. Each bug has a CVE number that corresponds with it, which we then record in the code reviews table. Not all code reviews are the result of bugs, so not every code review will have a CVE number.

Testing

Data Integrity Testing is important, especialy because we made a decision not to use foreign keys in our data table. For this reason we need to be sure that all of our data is consistent. An example of one of these tests is checking for the consistency of CVE numbers. Any CVE number recorded in the code review table should also be recorded in the CVE table. The integrity testing will make sure this is true for every case, and report back any abnormalities it catches within the data.

Goals

This week we started a Goal/Question/Metric spread sheet. You can read more about this strategy here, but basically its a way to take the main questions of what you’re researching/developing and break them down into specific metrics that you will need to attain in order to meet your goals. It’s a great way to come up with solid facts that back-up the point of your paper. This has been working really well for us so far, and we will continue to update this document throughout the project.

This project is sponsored by the Collaborative Research Experiences for Undergraduates, funded by the National Science Foundation.

« Home

Software Archeology @ RIT

Scraper

Loaders

Testing

Goals