Data Science Methods in Software Engineering

SWEN 789-01 (Graduate Special Topics)

Pradeep K. Murukannaiah

Email: pkmvse at rit-domain
Office hours: MW 2:00–3:00PM plus via email
Office: Golisano 70-1521

[ Home | Schedule | Reading | Paper Assignment | Deliverables ]

Semester Project

This is the main project for this class and it contributes 30% toward the course grade. In line with the theme of this course, the project should involve both data science and software engineering.

You can choose to do a literature survey or an implementation.

Identify project type and teammate March 20
Proposal March 27
Intermediate progress report April 10
Final report May 1

Literature Survey (Solo)

A literature survey helps organize existing knowledge on a specific topic. The survey should not merely be a summary of papers. Instead, the author must spend significant effort to organize the papers. The survey author should also provide his or her own perspective on the topic on top of the existing works. You should study a topic on which you can find at least 10 published papers.

I suggest that you stick on of the following themes:

  1. Choose a specific class of data science methods and study different applications of it in the SE domain. Example topics you can study include applications of neural networks to SE and applications of natural language processing SE (note that the second example can be too broad and you may need to scope it down).
  2. Choose a specific SE problem and study different solutions involving data science methods for that specific problem. Example topics you can study include data science methods for requirements prioritization and data science methods for bug prediction.

Important: Ideally, you should a study a topic on which there is not already a recent survey. This provides an opportunity to publish your work in a peer-reviewed conference or journal, in future.


Implementation (Solo or Group of Two)

You will implement a set of data science methods to solve a specific SE problem. You will require a dataset to start on this path. You can choose a publicly available dataset or collect the data yourself. Data collection can be nontrivial. Thus, if you choose to collect some data yourself, please discuss with me before you start. In the resources section below, I point to many publicly available datasets and specific problems you can address with those datasets.

Important: The effort involved must be nontrivial. It is not required that you implement a novel method, although I encourage it. However, it is required that your work involves significant effort in preprocessing the data, implementing an algorithm, or building a pipeline combining multiple methods. It is not acceptable to take a publicly available dataset and simply run it through an off-the-shelf data science method.