Data Science Methods in Software Engineering

SWEN 789-01 (Graduate Special Topics)

Pradeep K. Murukannaiah

Email: pkmvse at rit-domain
Office hours: MW 2:00–3:00PM plus via email
Office: Golisano 70-1521


[ Home | Schedule | Reading | Paper Assignment | Deliverables ]


Project 4: OSS Team Assembly

Scenario: Consider that you have data about associations between developers in an opensource software (OSS) community. Specifically, given a sample of projects from an OSS community, consider that you know which developers work on each project in the sample (you can easily collect such data for communities such as GitHub). Next, consider that you want to start a new OSS project and want to put together a team of developers for carrying out that project. You would like to form the team such that the developers have already worked with each other.

Task: Your task in this project is: given a team size, recommend three developer teams with most frequent associations. You will implement two frequent itemset mining algorithms.

Dataset: You can employ this synthetic dataset for recommending the developer teams. Treat each line (transaction) in the dataset as a unique project and each integer (item) in a line as a developer's ID. This dataset is adapted from a much larger T10I4D100K dataset here. If you are feeling adventurous, you can test your implementations with the original dataset!

Important: Your project may be tested on a completely different dataset. Thus, do not make any assumptions about number or transactions or items. However, you can assume that the dataset file will be in the same format and the developers' IDs will be integers (not necessarily consecutive, though).

Input: Your implementations for both Part 1 and Part 2 should accept at least three command-line inputs: dataset filename, team size, minimum support count.

An example invocation:
dev-apriori.py T48I1K.dat 5 30

Output: Your implementations for both Part 1 and Part 2 must print the following information, preferably in the same format (you may print more if you wish). An example output (dummy) for the above invocation is:

      Number of transactions: 1000
      Number of unique items: 48
      Recommendation 1: [10 14 18 23 25] (Support count: 51)
      Recommendation 2: [30 34 38 43 45] (Support count: 43)
      Recommendation 3: [1 13 18 23 25] (Support count: 32)
    
If no recommendations can be found for a given team size and minimum support count, indicate that:
No recommendations can be found!

Resources: Details on the algorithms you need to implement are in the slides. Further, you can also read this chapter from the text book, for free.

Analysis (required): Produce a line graph comparing the execution times of Part 1 and Part 2 implementations in the following format.

Part 3 (optional): Implement hash tree algorithm we discussed in class for computing the support counts for itemsets for each transaction (Section 6.2.4 in this chapter.)

Deliverables:

Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.