Data Science Methods in Software Engineering

SWEN 789-01 (Graduate Special Topics)

Pradeep K. Murukannaiah

Email: pkmvse at rit-domain
Office hours: MW 2:00–3:00PM plus via email
Office: Golisano 70-1521


[ Home | Schedule | Reading | Paper Assignment | Deliverables ]


Project 2: User Story Clustering

In this project, you will implement a K-Means clustering algorithm to cluster crowd-acquired user stories about smart home applications.

Bonus for the Best Implementation

Treat this as a competition! The best best implementation (measured based on the SSE value described below) will receive 20% bonus points for the project.

Dataset

You will use a smart home user stories dataset consisting of around 3000 user stories about smart home applications. These user stories are a subset of the user stories collected in this paper. I encourage you to read the paper, but all details required for this project are available on this page.

Each user story in the dataset is of the format “As a role, I want feature, so that benefit.” The second, third, and fourth column in the dataset correspond to role, feature, and benefit of a user story, respectively. The column is an autogenerated ID (note that that IDs are not contiguous).

For the following tasks, you can treat the combined text in the role, feature, and benefit columns of a row as a document.

Part 1: Text Preprocessing

In Part 1, you will perform the following text preprocessing on the user stories. After each step, you will only retain a subset of the words in each user story.

  1. Tokenzie.
  2. Convert each token to be in lower case.
  3. Parts of speech tagging: Retain only nouns, verbs, adjectives, and advers.
  4. Stop word removel (standard): Remove words from the Default English stopwords list.
  5. Stop word removel (custom): Create a custom list of stop words, e.g., smart, home, smart-home, and so on, and remove words from this custom list.
  6. Lemmatize: Reduce each word to its lemma.

Part 2: TF-IDF Computation and Vector Space Representation

In order to perform clustering, you must first represent the set of user stories in a vector space. To do so, follow the steps below.

Part 3: K-Means Clustering

Once you have the user stories represented in a vector space, implement the K-means clustering algorithm. Employ consine distance (1 - consine similarity) as the distance measure (also, see here).

Experiment with cluster sizes (K) 2 to 10.

For each value of K, compute both Sum of Squared Error or SSE (Slide 19) and Mean Silhouette Coefficient or MSC (Slide 97).

Part 4: Reporting Results

For each value of K, once the clustering algorithm converges, report and output file in the following format:

Part 5: Feature Selection (Optional)

In the previous part, you chose all the features in computing clusters. Some times, you may benefit from choosing only a subset of features. To do so, set a threshould for TF-IDF value and choose only those features whose TF-IDF values are above a that threshould across user stories. Doing so may eliminate some noisy features.

Software

Deliverables

Important: You have an intermediate and a final deadline for the project. You submit the above deliverables for each deadline. For the intermediate deadline, you should describe what you have accomplished so far. Of course, you are not expected to complete the project by the intermediate deadline, but you must show some progress.