Email: pkmvse at rit-domain

Office hours: MW 2:00–3:00PM plus via email

Office: Golisano 70-1521

In this project, you will implement a **Naive Bayes (NB)** classifier to predict whether a software module is likely to have defects or not.

Treat this as a competition! The best best implementation (measured based on multiple factors including the f1 score, quality of implementation, and documentation) will receive 20% bonus points for the project.

You will use the **CM1 dataset** publically available from the PROMISE Software Engineering Repository. The dataset is available in ARFF (for use with Weka) and CSV formats.

CM1 is a NASA spacecraft instrument written in C. This dataset consists observations about 498 software modules. Each observation consists of 21 features and a class variable listed below.

- loc: McCabe's line count of code
- v(g) : McCabe "cyclomatic complexity"
- ev(g) : McCabe "essential complexity"
- iv(g) : McCabe "design complexity"
- n : Halstead total operators + operands
- v : Halstead "volume"
- l : Halstead "program length"
- d : Halstead "difficulty"
- i : Halstead "intelligence"
- e : Halstead "effort"
- b : Halstead
- t : Halstead's time estimator
- lOCode : Halstead's line count
- lOComment : Halstead's count of lines of comments
- lOBlank : Halstead's count of blank lines
- lOCodeAndComment
- uniq_Op : unique operators
- uniq_Opnd : unique operands
- total_Op : total operators
- total_Opnd : total operands
- branchCount : of the flow graph
- defects {false,true}: module has/has not one or more reported defects

The features in the CM1 dataset are all continuous variable. In Part 1, you will discretize the features. You can employ any discretization technique including simple techniques that divide the values of a feature into partitions of K equal intervals or K equal frequencies. You can read more about discretization in [Tan 2006; Pages 57--63]. The Wikipedia page on discretization also has some useful references. Note that this task is open-ended. For example, the choice parameter setting (e.g., choice of K if you use equal length partitions) is up to you. You need to describe the technique and the parameter setting in your report.

In this part you will implement techniques to (1) **train an NB classifier** from the discrete features and the class variable; and (2) **precict the class** of a given observation with only features.

So far, you implemented an NB classifier. In this part, first, you will implement techniques to compute the following metrics for a given NB classification model.

- precision = tp / (tp + fp);
- recall = tp / (tp + fn); and
- f1 = (2 * precision * recall) / (precision + recall), where tp, fp, tn, and fn refer to true and false positives and negative, respectively.

You can read more about precision, recall, and f1 scores from the Wikipedia pages.

Next, you will implement technique to perform a 10-fold cross validation on the dataset. Read more from the Wikipedia page on cross-validation.

**Important: ** Note that you have an imbalanced dataset. You must make sure that training data in each fold consists of 9/10th (approximately) observations from both positive and negative classes.

In Parts 1 and 2, you descritized the features and implemented an NB classifier from it. However, discretization is not necessary. Instead, you can treat a feature as following continuous distribution. Then, you can employ the probability density function (PDF) of the continuous distribution in implementing the NB classifier. You may use Matlab to fit distributions.

Note that this is an optional part and is not graded. However, your final implementation can be based on a mix of discrete and continuous features. I will consider such an implementation as a candidate for the best implementation (and thus, bonus points).

- You must implement the project in Java or Python.
- You can use any tool for visualizing the data and getting familiar with it.
**Important:**You are being asked to implement each part from scratch. Thus, you should not use these functionalities from an existing tool or SDK.

- A project report in PDF format, providing instructions on how to run your project and describing the techniques you implememted.
- The complete source code of your project

**Important:** You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.