In this project, you will implement a K-Means clustering algorithm to cluster crowd-acquired user stories about smart home applications.
Treat this as a competition! The best best implementation (measured based on the SSE value described below) will receive 20% bonus points for the project.
You will use a smart home user stories dataset consisting of around 3000 user stories about smart home applications. These user stories are a subset of the user stories collected in this paper. I encourage you to read the paper, but all details required for this project are available on this page.
Each user story in the dataset is of the format “As a role, I want feature, so that benefit.” The second, third, and fourth column in the dataset correspond to role, feature, and benefit of a user story, respectively. The column is an autogenerated ID (note that that IDs are not contiguous).
For the following tasks, you can treat the combined text in the role, feature, and benefit columns of a row as a document.
In Part 1, you will perform the following text preprocessing on the user stories. After each step, you will only retain a subset of the words in each user story.
In order to perform clustering, you must first represent the set of user stories in a vector space. To do so, follow the steps below.
Once you have the user stories represented in a vector space, implement the K-means clustering algorithm. Employ consine distance (1 - consine similarity) as the distance measure (also, see here).
Experiment with cluster sizes (K) 2 to 10.
For each value of K, compute both Sum of Squared Error or SSE (Slide 19) and Mean Silhouette Coefficient or MSC (Slide 97).
For each value of K, once the clustering algorithm converges, report and output file in the following format:
In the previous part, you chose all the features in computing clusters. Some times, you may benefit from choosing only a subset of features. To do so, set a threshould for TF-IDF value and choose only those features whose TF-IDF values are above a that threshould across user stories. Doing so may eliminate some noisy features.
Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.