Data Science Methods in Software Engineering
SWEN 789-01 (Graduate Special Topics)
Pradeep K. Murukannaiah
Email: pkmvse at rit-domain
Office hours: MW 2:00–3:00PM plus via email
Office: Golisano 70-1521
Project 3: Developer Networks
In this project, you will construct a developer network of a Github project and identify influential developers in the netowrk via social network analysis.
Part 1: The Dataset
We will work with the Eclipse CHE (Java) Github project as the basis for constructing the network. We assume the following.
- Each committer (unique email ID) is a node.
- An edge between two two committers means that they have both committed at least one file in common.
- Edges are undirected and unweighted.
Important: To retrieve the required information, you can use the Github API or clone the Github repository and write a custom script or program with Git commands.
Part 1 Deliverables
- A text file where each line represent an edge in the graph in the format "email1:email2" indicating an edge from email1 to email2 (note that email2:email1 is not required as the graph is undirected)
Part 2: Constructing the Network
Now that you know the nodes and edges for the graph, it is time to construct a graph. We will use the Neo4J for that. You can download the community edition of Neo4J for free.
For creating the database, you can either use any of the following.
- Neo4J Cypher language
- Neo4J with Java
- Neo4J with Python (Note: I have never used this)
- NetworkX Python package (Note: There is a major difference between using Neo4J and NetworkX. The former comes with a graph database whereas the latter works with graphs in memory. Although this does not make a difference for this project, it makes a big difference in practical applications)
If you are using Neo4J, I recommend using the Java or Python version as opposed to purely Cypher commands as you will need some programming for the last part (note that you can also use Cypher commands from within Java and Python).
Important: All edges in Neo4J must be directed. However, in our case, we have undirected relationships. This is not a problem though. When you create the graph, you can create the edges as directed (does not matter which direction), but when you query the graph, you can ignore the direction. Read more here.
Part 2 Deliverables
- If using Neo4J, Cypher commands, the Java program, of the Python program used to create the database
- If using Neo4J, the Python program used to create the network
- A readme file containing any special instructions for running the commands or the program
Part 3: Analyzing the Network
Now that you have the graph constructed, compute the following measures for each node in the graph. You can use the Neo4J Java API for computing all paths and shortest paths between nodes as needed.
- Degree Centrality
- Betweenness Centrality
- Closeness Centrality
Refer to the Slides titled "Cheliotis+SocialNetworkAnalysis.pdf" in the Google Drive folder for the definitions of the above measures.
Part 3 Deliverables
- An output csv file in the following format for each node in the graph -- "email1,degree centrality,betweenness centrality,closeness centrality"
- The complete source code of your project.
- A readme file containing any special instructions for running the project
Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.