Software Archeology @ RIT

[ar·che·ol·o·gy] n. the study of people by way of their artifacts
Week 13 - Data Collection Complete

23 Nov 2013

Last Week

This past week we had been working heavily on firstly finishing the scraper, which would be used to actually collect data off of the Chromium website. Other priorities included finishing writing the loaders and tests, which would be used to load data into our database from scraped JSON files and then test their consistency, respectively. More information on all of this can be found on Shannon’s post from last week.

This Week

This week has been extremely exciting! We had the opportunity to finally run the scraper and collect our data. The scraper finished this past Saturday night (11/16/2013) and took only approximately 2.5 days to complete because we were able to use a concurrect connection. About 159,000 items were collected. Looking back at earlier estimates, we find that we had overestimated the size of our data we intended to collect. The data is only about 5.2GB uncompressed, or 426MB compressed.

Aside from the huge success of running the scraper, the team has also been continuing to work on the loaders and tests. As mentioned in previous posts and the “Last Week” section of this post, the loader will be used to take the JSON files we scraped, parse them, and put the data into our database. During the past week the loader was tested and it took about 60 minutes to load our test data into the database, where the test data was about 1,000 files. We will be focusing on optimizing this process so we can run it more efficiently on our recently-collected data in the near future. Writing data tests have been going well, but there are a lot more tests to write, which will continue to be a focus.

Essentially, apart from collecting our new data his week has consisted of 1. running our loader and tests on the test data to ensure it works so we can use it soon on our new data, and 2. refactoring code to be more efficient, using output statements and tools like the Ruby Benchmark class (see here for more information) to help us determine the speed at which our loading and tests run.

Next Week

In the coming week, we hope to start looking at the data we’ve scraped, loading it into our database and running our tests on it to make sure we loaded it in correctly. From there, we’ll be able to look at trends, associations, and anything else in the data to help us find answers to our initial research questions.

« Home