In this project, you will implement a Naive Bayes (NB) classifier to predict whether a software module is likely to have defects or not.
Treat this as a competition! The best best implementation (measured based on multiple factors including the f1 score, quality of implementation, and documentation) will receive 20% bonus points for the project.
You will use the CM1 dataset publically available from the PROMISE Software Engineering Repository. The dataset is available in ARFF (for use with Weka) and CSV formats.
CM1 is a NASA spacecraft instrument written in C. This dataset consists observations about 498 software modules. Each observation consists of 21 features and a class variable listed below.
The features in the CM1 dataset are all continuous variable. In Part 1, you will discretize the features. You can employ any discretization technique including simple techniques that divide the values of a feature into partitions of K equal intervals or K equal frequencies. You can read more about discretization in [Tan 2006; Pages 57--63]. The Wikipedia page on discretization also has some useful references. Note that this task is open-ended. For example, the choice parameter setting (e.g., choice of K if you use equal length partitions) is up to you. You need to describe the technique and the parameter setting in your report.
In this part you will implement techniques to (1) train an NB classifier from the discrete features and the class variable; and (2) precict the class of a given observation with only features.
You can read about Naive Bayes classification from [Tan 2006; Chapter 5; Pages 231--237]. Here is the Wikipedia page on NB classification.So far, you implemented an NB classifier. In this part, first, you will implement techniques to compute the following metrics for a given NB classification model.
You can read more about precision, recall, and f1 scores from the Wikipedia pages.
Next, you will implement technique to perform a 10-fold cross validation on the dataset. Read more from the Wikipedia page on cross-validation.
Important: Note that you have an imbalanced dataset. You must make sure that training data in each fold consists of 9/10th (approximately) observations from both positive and negative classes.
In Parts 1 and 2, you descritized the features and implemented an NB classifier from it. However, discretization is not necessary. Instead, you can treat a feature as following continuous distribution. Then, you can employ the probability density function (PDF) of the continuous distribution in implementing the NB classifier. You may use Matlab to fit distributions.
Note that this is an optional part and is not graded. However, your final implementation can be based on a mix of discrete and continuous features. I will consider such an implementation as a candidate for the best implementation (and thus, bonus points).
Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.