Software Archeology @ RIT

[ar·che·ol·o·gy] n. the study of people by way of their artifacts
Week 11 - NLP and Ownership

10 Nov 2014

Previous Week

In the previous week our team made progess on a variety of fronts. Areas of progress included continued focus on Natural Language Processing (NLP), code ownership, and major/minor contributor status. Read Kayla’s Post for more details.

This Week

This week our team has worked on many different components of the research project, with NLP and code ownership being the more major topics. Recently with NLP our subteam has rewritten code to phase out Treat and use NLTK instead. Although the team chose to originally try out Treat for NLP needs because it is a ruby toolkit, they encountered issues in the connections Treat makes with other NLP libraries in other languages. Using NLTK will provide a more reliable way to due the NLP research our team is investigating. Our team is excited about the potential for this NLP research.

This week in code ownership research, one team member has finished writing verify tests for the data collected by the OWNERship script mentioned in this post and will be starting to look into data analysis on the code ownership data. Another team member is making progress on a ‘first OWNERship script’ which given the github commit hashes (unique commit IDs) for commits involving modifications to OWNERS files, the script returns information on the first time any developer was added to a directory for OWNERS. Once this is finished, this information will be useful in seeing if a developer was added as an OWNER too quickly and how that might impact bugs/vulnerabilities in that directory, since only people explicitly defined as OWNERS for a given directory are able to approve code reviews in the directory. (Note that there are a variety of more complicated rules involving OWNERS, the link to which can be found in week 7 blog post.)

Future Plans

In the coming weeks our team plans on starting to wrap up current code ownership work as far as writing scripts and aggregating data. With the data, our team can start progress on including code ownership information into different metrics to find some interesting results. Additionally, our team will be continuing work on NLP and hopefully after more refinement gather some really interesting data in that area.

« Home