Data Science Methods in Software Engineering

SWEN 789-01 (Graduate Special Topics)

Pradeep K. Murukannaiah

Email: pkmvse at rit-domain
Office hours: MW 2:00–3:00PM plus via email
Office: Golisano 70-1521

[ Home | Schedule | Reading | Paper Assignment | Deliverables ]

Project 4: OSS Team Assembly

Scenario: Consider that you have data about associations between developers in an opensource software (OSS) community. Specifically, given a sample of projects from an OSS community, consider that you know which developers work on each project in the sample (you can easily collect such data for communities such as GitHub). Next, consider that you want to start a new OSS project and want to put together a team of developers for carrying out that project. You would like to form the team such that the developers have already worked with each other.

Task: Your task in this project is: given a team size, recommend three developer teams with most frequent associations. You will implement two frequent itemset mining algorithms.

Part 1: Implement a brute-force algorithm for mining frequent itemsets (this part will take significant computational resources for larger team sizes; be prepared to do something else when you run the program!)
Part 2: Implement the apriori algorithm for mining frequent itemsets.

Dataset: You can employ this synthetic dataset for recommending the developer teams. Treat each line (transaction) in the dataset as a unique project and each integer (item) in a line as a developer's ID. This dataset is adapted from a much larger T10I4D100K dataset here. If you are feeling adventurous, you can test your implementations with the original dataset!

Important: Your project may be tested on a completely different dataset. Thus, do not make any assumptions about number or transactions or items. However, you can assume that the dataset file will be in the same format and the developers' IDs will be integers (not necessarily consecutive, though).

Input: Your implementations for both Part 1 and Part 2 should accept at least three command-line inputs: dataset filename, team size, minimum support count.

For filename, use relative filename. It is best to assume that the dataset file is in the same directory as the main script or it is in a subdirectory called data.
For team size, I will test with sizes 2, 3, 4, 5, and 6.
For minimum support count, I will test with counts 20, 30, 40, 50, and 100.

An example invocation:

dev-apriori.py T48I1K.dat 5 30

Output: Your implementations for both Part 1 and Part 2 must print the following information, preferably in the same format (you may print more if you wish). An example output (dummy) for the above invocation is:

      Number of transactions: 1000
      Number of unique items: 48
      Recommendation 1: [10 14 18 23 25] (Support count: 51)
      Recommendation 2: [30 34 38 43 45] (Support count: 43)
      Recommendation 3: [1 13 18 23 25] (Support count: 32)

If no recommendations can be found for a given team size and minimum support count, indicate that:

No recommendations can be found!

Resources: Details on the algorithms you need to implement are in the slides. Further, you can also read this chapter from the text book, for free.

Analysis (required): Produce a line graph comparing the execution times of Part 1 and Part 2 implementations in the following format.

X-axis: Team size (2, 3, 4, 5, 6).
Y-axis: Execution time in milliseconds.
You can measure the execution time within the script or using an external utility.
You may run each part multiple times and report the average.
You can use any tool (e.g., Matlab, R, matplotlib, or Excel) for producing the graph.

Part 3 (optional): Implement hash tree algorithm we discussed in class for computing the support counts for itemsets for each transaction (Section 6.2.4 in this chapter.)

Deliverables:

The complete source code of your project.
The output from your Part 1 and Part 2 implementations for different team sizes (2, 3, 4, 5, 6) for a minimum support count of 30.
A graph showing the execution time comparison.
A project report describing the instructions for running the project, and a high-level description of the implementations. You can also include the outputs and the graph as part of the report.

Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.