Data Science Methods in Software Engineering

SWEN 789-01 (Graduate Special Topics)

Pradeep K. Murukannaiah

Email: pkmvse at rit-domain
Office hours: MW 2:00–3:00PM plus via email
Office: Golisano 70-1521

[ Home | Schedule | Reading | Paper Assignment | Deliverables ]

Project 1: Software Defect Prediction

In this project, you will implement a Naive Bayes (NB) classifier to predict whether a software module is likely to have defects or not.

Bonus for the Best Implementation

Treat this as a competition! The best best implementation (measured based on multiple factors including the f1 score, quality of implementation, and documentation) will receive 20% bonus points for the project.

Dataset

You will use the CM1 dataset publically available from the PROMISE Software Engineering Repository. The dataset is available in ARFF (for use with Weka) and CSV formats.

CM1 is a NASA spacecraft instrument written in C. This dataset consists observations about 498 software modules. Each observation consists of 21 features and a class variable listed below.

loc: McCabe's line count of code
v(g) : McCabe "cyclomatic complexity"
ev(g) : McCabe "essential complexity"
iv(g) : McCabe "design complexity"
n : Halstead total operators + operands
v : Halstead "volume"
l : Halstead "program length"
d : Halstead "difficulty"
i : Halstead "intelligence"
e : Halstead "effort"
b : Halstead
t : Halstead's time estimator
lOCode : Halstead's line count
lOComment : Halstead's count of lines of comments
lOBlank : Halstead's count of blank lines
lOCodeAndComment
uniq_Op : unique operators
uniq_Opnd : unique operands
total_Op : total operators
total_Opnd : total operands
branchCount : of the flow graph
defects {false,true}: module has/has not one or more reported defects

Part 1: Discretization

The features in the CM1 dataset are all continuous variable. In Part 1, you will discretize the features. You can employ any discretization technique including simple techniques that divide the values of a feature into partitions of K equal intervals or K equal frequencies. You can read more about discretization in [Tan 2006; Pages 57--63]. The Wikipedia page on discretization also has some useful references. Note that this task is open-ended. For example, the choice parameter setting (e.g., choice of K if you use equal length partitions) is up to you. You need to describe the technique and the parameter setting in your report.

Part 2: NB Classifier

In this part you will implement techniques to (1) train an NB classifier from the discrete features and the class variable; and (2) precict the class of a given observation with only features.

You can read about Naive Bayes classification from [Tan 2006; Chapter 5; Pages 231--237]. Here is the Wikipedia page on NB classification.

Part 3: Metrics and Cross Validation

So far, you implemented an NB classifier. In this part, first, you will implement techniques to compute the following metrics for a given NB classification model.

precision = tp / (tp + fp);
recall = tp / (tp + fn); and
f1 = (2 * precision * recall) / (precision + recall), where tp, fp, tn, and fn refer to true and false positives and negative, respectively.

You can read more about precision, recall, and f1 scores from the Wikipedia pages.

Next, you will implement technique to perform a 10-fold cross validation on the dataset. Read more from the Wikipedia page on cross-validation.

Important: Note that you have an imbalanced dataset. You must make sure that training data in each fold consists of 9/10th (approximately) observations from both positive and negative classes.

Part 4 (Optional): Continuous Features

In Parts 1 and 2, you descritized the features and implemented an NB classifier from it. However, discretization is not necessary. Instead, you can treat a feature as following continuous distribution. Then, you can employ the probability density function (PDF) of the continuous distribution in implementing the NB classifier. You may use Matlab to fit distributions.

Note that this is an optional part and is not graded. However, your final implementation can be based on a mix of discrete and continuous features. I will consider such an implementation as a candidate for the best implementation (and thus, bonus points).

Software

You must implement the project in Java or Python.
You can use any tool for visualizing the data and getting familiar with it.
Important: You are being asked to implement each part from scratch. Thus, you should not use these functionalities from an existing tool or SDK.

Deliverables

A project report in PDF format, providing instructions on how to run your project and describing the techniques you implememted.
The complete source code of your project

Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.