Data Science Methods in Software Engineering

SWEN 789-01 (Graduate Special Topics)

Pradeep K. Murukannaiah

Email: pkmvse at rit-domain
Office hours: MW 2:00–3:00PM plus via email
Office: Golisano 70-1521

[ Home | Schedule | Reading | Paper Assignment | Deliverables ]

Project 2: User Story Clustering

In this project, you will implement a K-Means clustering algorithm to cluster crowd-acquired user stories about smart home applications.

Bonus for the Best Implementation

Treat this as a competition! The best best implementation (measured based on the SSE value described below) will receive 20% bonus points for the project.

Dataset

You will use a smart home user stories dataset consisting of around 3000 user stories about smart home applications. These user stories are a subset of the user stories collected in this paper. I encourage you to read the paper, but all details required for this project are available on this page.

Each user story in the dataset is of the format “As a role, I want feature, so that benefit.” The second, third, and fourth column in the dataset correspond to role, feature, and benefit of a user story, respectively. The column is an autogenerated ID (note that that IDs are not contiguous).

For the following tasks, you can treat the combined text in the role, feature, and benefit columns of a row as a document.

Part 1: Text Preprocessing

In Part 1, you will perform the following text preprocessing on the user stories. After each step, you will only retain a subset of the words in each user story.

Tokenzie.
Convert each token to be in lower case.
Parts of speech tagging: Retain only nouns, verbs, adjectives, and advers.
Stop word removel (standard): Remove words from the Default English stopwords list.
Stop word removel (custom): Create a custom list of stop words, e.g., smart, home, smart-home, and so on, and remove words from this custom list.
Lemmatize: Reduce each word to its lemma.

Part 2: TF-IDF Computation and Vector Space Representation

In order to perform clustering, you must first represent the set of user stories in a vector space. To do so, follow the steps below.

Treat each user story as a vector.
Treat each unique token (after all text preprocessing steps) in the corpus (set of all user stories) as a vector dimension.
Employ TF-IDF scores as the values for each vector dimension. A TF-IDF value is the product of term frequency and inverse document frequency. Use these definitions of term frequency and inverse document frequency.

Part 3: K-Means Clustering

Once you have the user stories represented in a vector space, implement the K-means clustering algorithm. Employ consine distance (1 - consine similarity) as the distance measure (also, see here).

Experiment with cluster sizes (K) 2 to 10.

For each value of K, compute both Sum of Squared Error or SSE (Slide 19) and Mean Silhouette Coefficient or MSC (Slide 97).

Part 4: Reporting Results

For each value of K, once the clustering algorithm converges, report and output file in the following format:

One line for each cluster
Each line in the format “Cluster N: User Story ID 1, User Story ID 2, ... ”
Last line in the file should report SSE and MSC in the format “SSE: value, MSC: value”

Part 5: Feature Selection (Optional)

In the previous part, you chose all the features in computing clusters. Some times, you may benefit from choosing only a subset of features. To do so, set a threshould for TF-IDF value and choose only those features whose TF-IDF values are above a that threshould across user stories. Doing so may eliminate some noisy features.

Software

You must implement the project in Java or Python.
You must use Stanford CoreNLP (Java) or NLTK (Python) for text preprocessing.
You can use any tool for visualizing the data and getting familiar with it.
Important: You are being asked to implement TF-IDF computation and K-Means clustering algorithm from scratch. Thus, you should not use these functionalities from an existing tool or SDK.

Deliverables

A project report in PDF format, providing instructions on how to run your project and describing the techniques you implememted.
The complete source code of your project.
The output file for produced for each value of K.
An SSE plot (x-axis: cluster size (K), y-axis: SSE value).
An MSC plot (x-axis: cluster size (K), y-axis: MSC value).

Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.