Data Science Methods in Software Engineering

SWEN 789-01 (Graduate Special Topics)

Pradeep K. Murukannaiah

Email: pkmvse at rit-domain
Office hours: MW 2:00–3:00PM plus via email
Office: Golisano 70-1521

[ Home | Schedule | Reading | Paper Assignment | Deliverables ]

Project 1: Software Defect Prediction

In this project, you will implement a Naive Bayes (NB) classifier to predict whether a software module is likely to have defects or not.

Bonus for the Best Implementation

Treat this as a competition! The best best implementation (measured based on the f1 score described below) will receive 20% bonus points for the project.


You will use the CM1 dataset publically available from the PROMISE Software Engineering Repository. The dataset is available in ARFF (for use with Weka) and CSV formats.

CM1 is a NASA spacecraft instrument written in C. This dataset consists observations about 498 software modules. Each observation consists of 21 features and a class variable listed below.

  1. loc: McCabe's line count of code
  2. v(g) : McCabe "cyclomatic complexity"
  3. ev(g) : McCabe "essential complexity"
  4. iv(g) : McCabe "design complexity"
  5. n : Halstead total operators + operands
  6. v : Halstead "volume"
  7. l : Halstead "program length"
  8. d : Halstead "difficulty"
  9. i : Halstead "intelligence"
  10. e : Halstead "effort"
  11. b : Halstead
  12. t : Halstead's time estimator
  13. lOCode : Halstead's line count
  14. lOComment : Halstead's count of lines of comments
  15. lOBlank : Halstead's count of blank lines
  16. lOCodeAndComment
  17. uniq_Op : unique operators
  18. uniq_Opnd : unique operands
  19. total_Op : total operators
  20. total_Opnd : total operands
  21. branchCount : of the flow graph
  22. defects {false,true}: module has/has not one or more reported defects

Part 1: Discretization

The features in the CM1 dataset are all continuous variable. In Part 1, you will discretize the features. You can employ any discretization technique including simple techniques that divide the values of a feature into partitions of K equal intervals or K equal frequencies. You can read more about discretization in [Tan 2006; Pages 57--63]. The Wikipedia page on discretization also has some useful references. Note that this task is open-ended. For example, the choice parameter setting (e.g., choice of K if you use equal length partitions) is up to you. You need to describe the technique and the parameter setting in your report.

Part 2: NB Classifier

In this part you will implement techniques to (1) train an NB classifier from the discrete features and the class variable; and (2) precict the class of a given observation with only features.

You can read about Naive Bayes classification from [Tan 2006; Chapter 5; Pages 231--237]. Here is the Wikipedia page on NB classification.

Part 3: Metrics and Cross Validation

So far, you implemented an NB classifier. In this part, first, you will implement techniques to compute the following metrics for a given NB classification model.

You can read more about precision, recall, and f1 scores from the Wikipedia pages.

Next, you will implement technique to perform a 10-fold cross validation on the dataset. Read more from the Wikipedia page on cross-validation.

Important: Note that you have an imbalanced dataset. You must make sure that training data in each fold consists of 9/10th (approximately) observations from both positive and negative classes.

Part 4 (Optional): Continuous Features

In Parts 1 and 2, you descritized the features and implemented an NB classifier from it. However, discretization is not necessary. Instead, you can treat a feature as following continuous distribution. Then, you can employ the probability density function (PDF) of the continuous distribution in implementing the NB classifier. You may use Matlab to fit distributions.

Note that this is an optional part and is not graded. However, your final implementation can be based on a mix of discrete and continuous features. I will consider such an implementation as a candidate for the best implementation (and thus, bonus points).



Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.