Data Science Methods in Software Engineering

SWEN 789-01 (Graduate Special Topics)

Pradeep K. Murukannaiah

Email: pkmvse at rit-domain
Office hours: MW 2:00–3:00PM plus via email
Office: Golisano 70-1521


[ Home | Schedule | Reading | Paper Assignment | Deliverables ]


Project 3: Developer Networks

In this project, you will construct a developer network of a Github project and identify influential developers in the netowrk via social network analysis.

Part 1: The Dataset

We will work with the Eclipse CHE (Java) Github project as the basis for constructing the network. We assume the following.

  1. Each committer (unique email ID) is a node.
  2. An edge between two two committers means that they have both committed at least one file in common.
  3. Edges are undirected and unweighted.

Important: To retrieve the required information, you can use the Github API or clone the Github repository and write a custom script or program with Git commands.

Part 1 Deliverables

Part 2: Constructing the Network

Now that you know the nodes and edges for the graph, it is time to construct a graph. We will use the Neo4J for that. You can download the community edition of Neo4J for free.

For creating the database, you can either use any of the following.

If you are using Neo4J, I recommend using the Java or Python version as opposed to purely Cypher commands as you will need some programming for the last part (note that you can also use Cypher commands from within Java and Python).

Important: All edges in Neo4J must be directed. However, in our case, we have undirected relationships. This is not a problem though. When you create the graph, you can create the edges as directed (does not matter which direction), but when you query the graph, you can ignore the direction. Read more here.

Part 2 Deliverables

Part 3: Analyzing the Network

Now that you have the graph constructed, compute the following measures for each node in the graph. You can use the Neo4J Java API for computing all paths and shortest paths between nodes as needed.

Refer to the Slides titled "Cheliotis+SocialNetworkAnalysis.pdf" in the Google Drive folder for the definitions of the above measures.

Part 3 Deliverables

Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.