Vulnerability History

SWEN 331 Vulnerability History Assignment

The purpose of this assignment is to have you see some real vulnerabilities up close. When you see the kinds of vulnerabilities that can happen in real products, in real life, you get a sense for how difficult they can be to find and prevent.

A secondary purpose of this assignment is for you to contribute open source vulnerability history to the academic world. This assignment is also a data curation project that can produce data useful to researchers and developers alike.

Broadly, your responsibilities are to:

Round 1. Investigate Vulnerabilities

You will be given some vulnerabilities to research and write about. Here’s what you need to do:

  1. Set up a GitHub account. If do not have a GitHub account, you will need to create one. We recommend using a permanent, professional name as this will likely go on your resume.
  2. Give copyright consent and notify us of your GitHub username. We would like you to contribute your work to a Creative Commons/MIT Licensed repository to be used in academic research. Also, please notify us via this survey what your GitHub username is so that we can trace your GitHub username to your RIT username. NOTE: your contribution to open source is voluntary. We will make similar arrangements to submit your report privately if you do not wish to contribute to this research project. Your grade will not be affected.
  3. Get your CVEs from here. Everyone will be assigned three CVEs to research. You are only required to research two of these. We give you a third in case one of these CVEs is a dead-end. Examples of a dead-end CVE would be: a vulnerability where the fix is ONLY updating a dependency (i.e. the vulnerability was not in the product proper), or a vulnerability where all of the bugs are embargoed and no public information is available. If you have two dead-end vulnerabilities, get in touch with your instructor or TA and they will assign you a new CVE.
  4. Learn about the case study projects. This semester, we are looking at 3 projects: Apache Tomcat, Apache Struts, and Apache HTTPD. Take a few minutes to look over their documentation to familiarize yourself with the project and product.
  5. Fork our research repositories. We have one repository for each of these projects: httpd-vulnerabilities, struts-vulnerabilities, and tomcat-vulnerabilities. You can read about forking on GitHub’s docs.
  6. Clone the research repositories locally using your favorite Git client.
  7. Open up your assigned CVE files in a good text editor. For example, cves/CVE-2016-1546.yml in the httpd-vulnerabilities repo. You will be editing YAML for this assignment, which is a human-friendly JSON-like format that we use for structuring our data. Here’s another helpful link about YAML. It would be helpful if your text editor support syntax highlighting of YAML files so you can avoid syntax errors. My personal favorites are Atom and SublimeText.
  8. Read the research notes that are currently there for the vulnerability, including the questions that need to be filled out.
  9. Set curated to true in your YAML file. Save the file.
  10. Set up your pull request using the submission instructions below.
  11. Go to the Struts/Tomcat/HTTPD source code repositories. We have these repos main repository in read-only form on Nitron. SSH into http://nitron.se.rit.edu and you can see the repositories /pub/swen-331/. Once you cd into that directory you can run git commands there to do your investigation. You can also do your own git clone by going to Github and getting their mirrors. NOTE! Tomcat has multiple repositories: tomcat (for latest version 9), tomcat55, tomcat70, tomcat80, and tomcat85.
  12. Find the vulnerability fix. Generally speaking we should have these for you, but you may need to correct and fix the data. Go to the git repository that you downloaded and find the fix (git show is good for this). Make sure this fix makes sense for your vulnerability. If there are multiple fix commits, then make sure they are all in there.
  13. Find the VCC (Vulnerability-Contributing Commit). Next, we want to dig into the changes to the files that were affected by this change and attempt to find the commit(s) that introduced this vulnerability in the first place. For this, you will need to follow our example below, but it is essentially making use of git blame. Record the VCC commit hash in the data. This is the most important part of the project in terms of its academic contribution.
  14. Find the commits between the VCC and fix. Using git log, get the commits between the VCC(s) and fix(es). You do not need to record these - we will be collecting this automatically in the future based on your VCC. But, these will be the basis for the next step.
  15. Read. Begin reading the commit messages, bug reports, and code reviews between the VCC(s) and fix(es). Record any observations, such as major events or linguistic notes as you go. Do your best to get a “big picture” of how this development team works during this time, inferring anything you can about their process, expertise, constraints, etc.
  16. Record your findings. Research the following pieces and contribute them to your CVE YAML files. We have notes in the YAML about precisely what we are looking for. Also, we have a detailed example below that Prof. Meneely did himself.
  1. Submit a Pull Request. See below.

Round 1 Submission: Pull Request

Remember, though we have assigned you three CVEs, you are only responsible for researching two. We give you an extra in case of a dead-end. We cannot give extra credit for doing a third, but if you do all three we would consider it a donation of research data.

To submit, you must create a pull request from your forked repository to ours.

  1. Fork, clone, edit, push. Clone your forked repo locally and edit your YAML files. Commit your changes to the dev branch, or to your own branch if you like. Push to your forked repo. Tip: if you have never done this before with GitHub, this may be a learning experience for you. Get help from your friends, TA, or instructor.
  2. Create single a pull request against our dev branch of our research repositories. You must name your pull request after the CVEs that you are editing, for example “CVE-2011-3092 and CVE-2011-3093”. Write a brief description that will become the commit message. Please create one pull request that edits all of your CVE files if you have multiple for one project.
  3. Correct anything from feedback or build tests. We have all of our YAML files run against some integrity checkers on Travis CI to verify they have roughly the correct structure. If your pull request does not pass, then check the details of the build to see which tests failed. You might have broken YAML syntax, or the wrong format for a commit, or some other issue. You must fix build issues before the Round 1 deadline. To fix any issues, just edit your CVEs, commit, and push. The pull request will automatically update (you don’t need to create a new pull request if you want to correct something). Don’t worry about committing and pushing too many times - we will “squash” your commits into one commit on the final merge. Note: only checking your work on Travis can be tedious. If you want to run the integrity checkers locally, see the README.md at the root of this repository.
  4. Respond to feedback. You might get immediate feedback from someone else on the project, even before Round 1 is done. See details below.

Grading:

Round 2: Reviewing other Pull Requests

A high-quality comment is insightful, actionable, and constructive. We do not place a number on how many comments you are to supposed give - only on how helpful and insightful they are.

Grading:

Example Vulnerability: CVE-2011-3092

Here’s an in-depth example done by Prof. Meneely on CVE 2011 3092. You can see his final YAML file.

Understanding the Vulnerability

The first thing I did was to read the CVE entry for the vulnerability to make sure I understood what it was. I then looked up anything I didn’t know.

For example, I learned from Wikipedia that Google’s V8 engine is their Javascript engine and it’s the same engine they’ve had since the beginning of Chrome. V8 is also used in a bunch of other products, such as NodeJS and Electron.

Next, I followed the links from the CVE entry to bug 122337. I read the description and comments. I made some notes on how it was discovered (via LangFuzz), and that everyone involved had Google email addresses. I answered those questions in my YAML and made some other notes about how it was found.

I also noted that the bug had “Blink” as its “component”, but the description said “v8” as the subsystem. I began looking at the differences between the two, and decided to go with v8 as the subsystem. Blink is a massive rendering engine, so v8 would be more specific.

I also started making more notes for my final “mistakes” question report.

Correcting the fix commit record

In this situation, we did not have a fix that was linked from the CVE. We had some data here from a prior study that we collected automatically, but it turns out it was wrong. So I had to find the fix myself. Hopefully you won’t have to do this part, but it’s instructive anyway.

I cd’ed into my Chromium source tree and ran the following commands.

First I tried:

$ git log --grep="CVE-2011-3092"
		

Sometimes you get lucky and they mention the CVE in the commit message. I was not so lucky.

Next I tried searching commit messages for the bug id:

$ git log --grep="122337"
		

The commits I got mention a code review with that number, but no BUG= clause that mentions this fix.

Next I tried searching for “invalid write” in the commit log and scrolled through to look around the dates when this was fixed.

$ git log --grep="invalid write"
		

No such luck.

It was at this point that I realized that V8 is actually a separate project for all kinds of things (as I stated above). It actually has its own repository, which I cloned and began my searching there. For your vulnerability, don’t go beyond the Chromium repository - this one is just an example.

I re-ran the above searches with no luck. But, I did notice that person who patched the bug, Erik Corry, was on several commits. So, I examined commits around the time that the vulnerabliity would have been patched (April 12, 2012).

$ git log --before=2012-04-15 --stat
		

I used the --stat here to show the files that were changed because maybe they would show me some information about what was changed.

Sure enough, I came across this commit:

commit b32ff09a49fe4c76827e717f911e5a0066bdad4b
		Author: erik.corry@gmail.com <erik.corry@gmail.com@ce2b1a6d-e550-0410-aec6-3dcde31c8c00>
		Date:   Fri Apr 13 11:03:22 2012 +0000

		    Regexp.rightContext was still not quite right.  Fixed and
		    added more tests.
		    Review URL: https://chromiumcodereview.appspot.com/10008104

		    git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@11312 ce2b1a6d-e550-0410-aec6-3dcde31c8c00

		 src/macros.py                    |  9 +++++++++
		 src/regexp.js                    | 16 +++++++++-------
		 test/mjsunit/regexp-capture-3.js | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
		 3 files changed, 77 insertions(+), 8 deletions(-)
		

I went to the code review URL for this commit and took a look at the test case. The test cases in regexp-capture-3.js match almost exactly to the ones found by the fuzzer in the bug. We found our fix! I updated my CVE YAML file with the new fix commit hash (b32ff09a49fe4c76827e717f911e5a0066bdad4b).

I also updated my CVE yaml file with an answer to the unit testing question - it’s clear that this code was tested prior to this vulnerabliity, but it was also not fully tested as they had to add a new test case in fix it.

I should note that this particular fix was rather difficult to find. Hopefully you won’t have to do so much work to find your fix, or that the fix we automatically found for you is correct.

Finding the Vulnerability-Contributing Commit (VCC)

Next, we need to find our VCC. Looking at our commit, we have three files were impacted:

Given that the third file is clearly a test case, we do not need to trace its history. The vulnerability doesn’t exist in test cases, only in production code. So we will be tracing our VCCs on the first two files.

Let’s take a look at our commit more closely. Using the command line, we can do this:

$ git show b32ff09a49fe4c76827e717f911e5a0066bdad4b
		

This gives us this output (I’m abbreviating for just the first file…)

commit b32ff09a49fe4c76827e717f911e5a0066bdad4b
		Author: erik.corry@gmail.com <erik.corry@gmail.com@ce2b1a6d-e550-0410-aec6-3dcde31c8c00>
		Date:   Fri Apr 13 11:03:22 2012 +0000

		    Regexp.rightContext was still not quite right.  Fixed and
		    added more tests.
		    Review URL: https://chromiumcodereview.appspot.com/10008104

		    git-svn-id: http://v8.googlecode.com/svn/branches/bleeding_edge@11312 ce2b1a6d-e550-0410-aec6-3dcde31c8c00

		diff --git a/src/macros.py b/src/macros.py
		index 93287ae..699b368 100644
		--- a/src/macros.py
		+++ b/src/macros.py
		@@ -204,6 +204,15 @@ macro CAPTURE(index) = (3 + (index));
		 const CAPTURE0 = 3;
		 const CAPTURE1 = 4;

		+# For the regexp capture override array.  This has the same
		+# format as the arguments to a function called from
		+# String.prototype.replace.
		+macro OVERRIDE_MATCH(override) = ((override)[0]);
		+macro OVERRIDE_POS(override) = ((override)[(override).length - 2]);
		+macro OVERRIDE_SUBJECT(override) = ((override)[(override).length - 1]);
		+# 1-based so index of 1 returns the first capture
		+macro OVERRIDE_CAPTURE(override, index) = ((override)[(index)]);
		+
		 # PropertyDescriptor return value indices - must match
		 # PropertyDescriptorIndices in runtime.cc.
		 const IS_ACCESSOR_INDEX = 0;
		

What you see here is a diff, or a code difference from before the commit to after the commit. Anywhere you see a + sign is code that was added, and - means deleted. We have no lines that were deleted in this example.

Now, looking at this code, you can see a few things going on here. First, to fix this vulnerability, we’re defining a bunch of new methods. And that’s ALL we’re doing here. At this point you need to ask yourself: was there a security mistake made here? Or was the mistake made elsewhere and the fix was required to be here? Will we be able to point to a moment in time in the history of this file where a mistake was made?

In this situation, we’re going to answer “No”. Since the code is effectively a header file, the security mistake was not here. It’s really in the logic that needed these extra checks. So we’re going to cut off our VCC search for src/macros.py and continue to src/regexp.js

That being said, header files can have vulnerabilities in them. They can have wrong constants, wrong default, poor configuration, and many other problems that are the vulnerability. In fact, we have seen vulnerabilities where the fix was only modifying a constant. So don’t ignore header files without looking at your unique situation.

Ok, let’s get back to src/regexp.js.

$ git show b32ff09a49fe4c76827e717f911e5a0066bdad4b
		

…abbreviating to show you what I’m looking at…

diff --git a/src/regexp.js b/src/regexp.js
		index eb617ea..7bcb612 100644
		--- a/src/regexp.js
		+++ b/src/regexp.js
		@@ -296,7 +296,7 @@ function RegExpToString() {
		 // of the last successful match.
		 function RegExpGetLastMatch() {
		   if (lastMatchInfoOverride !== null) {
		-    return lastMatchInfoOverride[0];
		+    return OVERRIDE_MATCH(lastMatchInfoOverride);
		   }
		   var regExpSubject = LAST_SUBJECT(lastMatchInfo);
		   return SubString(regExpSubject,
		@@ -334,8 +334,8 @@ function RegExpGetLeftContext() {
		     subject = LAST_SUBJECT(lastMatchInfo);
		   } else {
		     var override = lastMatchInfoOverride;
		-    start_index = override[override.length - 2];
		-    subject = override[override.length - 1];
		+    start_index = OVERRIDE_POS(override);
		+    subject = OVERRIDE_SUBJECT(override);
		   }
		   return SubString(subject, 0, start_index);
		 }
		@@ -349,9 +349,9 @@ function RegExpGetRightContext() {
		     subject = LAST_SUBJECT(lastMatchInfo);
		   } else {
		     var override = lastMatchInfoOverride;
		-    subject = override[override.length - 1];
		-    var pattern = override[override.length - 3];
		-    start_index = override[override.length - 2] + pattern.length;
		+    subject = OVERRIDE_SUBJECT(override);
		+    var match = OVERRIDE_MATCH(override);
		+    start_index = OVERRIDE_POS(override) + match.length;
		   }
		   return SubString(subject, start_index, subject.length);
		 }
		@@ -363,7 +363,9 @@ function RegExpGetRightContext() {
		 function RegExpMakeCaptureGetter(n) {
		   return function() {
		     if (lastMatchInfoOverride) {
		-      if (n < lastMatchInfoOverride.length - 2) return lastMatchInfoOverride[n];
		+      if (n < lastMatchInfoOverride.length - 2) {
		+        return OVERRIDE_CAPTURE(lastMatchInfoOverride, n);
		+      }
		       return '';
		     }
		     var index = n * 2;
		

This code looks like we had some faulty logic. In several methods, there’s a base case check at the beginning of the method that needed correction to check a boundary case. Let’s figure out what commits contributed this faulty logic. Notice how the diff has these cryptic lines:

@@ -296,7 +296,7 @@ function RegExpToString() {
		...
		@@ -334,8 +334,8 @@ function RegExpGetLeftContext() {
		...
		@@ -349,9 +349,9 @@ function RegExpGetRightContext() {
		...
		@@ -363,7 +363,9 @@ function RegExpGetRightContext() {
		...
		

These are called hunk headers. The top of each hunk tells use where the diff begins and ends. In the first hunk, it starts at line 296 on the old file. Now the next few lines are unchanged to show context, so we’re really looking for line 299 on that first hunk.

Now let’s use a Git tool called blame (or if you prefer the nicer word annotate - same thing). To see what this looks like, take a look this link. It’s the blame output on the web version of this code. Git blame will go through an entire file and show you the last commit that touched a given line. This lets you figure out how mistakes may have been originally introduced.

Now, it’s important that we look at this historically, meaning, we don’t want to look at this today but at the time of the vulnerability fix. So that means we need to include our fix commit in our command. Furthermore, we need to look at the “commit just before the fix”, otherwise we’ll just see our own commit fix in the blame. Git uses the ^ symbol to indicate “the commit before”. So our Git command look like this:

$ git blame b32ff09a49fe4c76827e717f911e5a0066bdad4b^ -- src/regexp.js
		

We get a ton of output. Sometimes Git blame can be very slow. Like, minutes. If that’s the case, you can limit your blaming to just the lines you need. See the git blame docs for -L.

Here’s an abbreviated output for me:

43d26ecc src/regexp-delay.js (christian.plesner.hansen 2008-07-03 15:10:15 +0000 293) // Getters for the static properties lastMatch, lastParen, leftContext, and
		43d26ecc src/regexp-delay.js (christian.plesner.hansen 2008-07-03 15:10:15 +0000 294) // rightContext of the RegExp constructor.  The properties are computed based
		43d26ecc src/regexp-delay.js (christian.plesner.hansen 2008-07-03 15:10:15 +0000 295) // on the captures array of the last successful match and the subject string
		43d26ecc src/regexp-delay.js (christian.plesner.hansen 2008-07-03 15:10:15 +0000 296) // of the last successful match.
		43d26ecc src/regexp-delay.js (christian.plesner.hansen 2008-07-03 15:10:15 +0000 297) function RegExpGetLastMatch() {
		0adfe842 src/regexp.js       (lrn@chromium.org         2010-04-21 08:33:04 +0000 298)   if (lastMatchInfoOverride !== null) {
		0adfe842 src/regexp.js       (lrn@chromium.org         2010-04-21 08:33:04 +0000 299)     return lastMatchInfoOverride[0];
		0adfe842 src/regexp.js       (lrn@chromium.org         2010-04-21 08:33:04 +0000 300)   }
		912c8eb0 src/regexp-delay.js (erik.corry@gmail.com     2009-03-11 14:00:55 +0000 301)   var regExpSubject = LAST_SUBJECT(lastMatchInfo);
		912c8eb0 src/regexp-delay.js (erik.corry@gmail.com     2009-03-11 14:00:55 +0000 302)   return SubString(regExpSubject,
		912c8eb0 src/regexp-delay.js (erik.corry@gmail.com     2009-03-11 14:00:55 +0000 303)                    lastMatchInfo[CAPTURE0],
		912c8eb0 src/regexp-delay.js (erik.corry@gmail.com     2009-03-11 14:00:55 +0000 304)                    lastMatchInfo[CAPTURE1]);
		9da356ee src/regexp-delay.js (ager@chromium.org        2008-10-03 07:14:31 +0000 305) }
		

The few lines that are most relevant to us are 298-300. In commit 0adfe842 (short for 0adfe842a515dd206cb0322d17c05f97244c0e72), we found that someone wrote that original if-statement and did not check the boundary case for our vulnerability. That’s our first VCC.

Well… maybe. Sometimes people fix whitespace in one commit. Correct warnings. Rename stuff. Re-order methods. There’s lots of refactoring that can initially look like a VCC. So I double-checked that this was a meaningful commit using git show 0adfe842 and found that, yes, it was introducing new logic. So, yes, I’m convinced this is our first VCC.

I recorded 0adfe842a515dd206cb0322d17c05f97244c0e72 as a VCC. You’ll notice that this commit occurred nearly two years before the fix came. That’s pretty average for vulnerabilities. Mistakes last a long time, even in big systems like Chrome and V8.

At this point I go back to my blame output and find the other commits for the other hunks. I found one and double-checked one other VCC. So my VCCs are:

I recorded these in my YAML file.

I think it’s interesting that these two VCCs were close to each other in time, and they were committed by the same person. I’m going to record that in my final “mistakes” report.

There you go! That’s how you find VCCs! It’s a tedious process at first, but it speeds up once you get used to what you’re doing. Some researchers have famously automated this process, and it’s been in wide use in the mining software repositories academic research community. The only difference in their approach and ours is that we made some pruning of our search based on our expertise of unit tests, header files, and other coding knowledge we have.

Here are some VCC Guidelines:

Looking between VCC and Fix

Next, we need to do some more research on what happened between our two VCCs and fixes. We’ve determined that the mistake was in src/regexp.js, so let’s look at what happened between the 2010 dates and 2012 dates with that file.

The 498b074bd0db2913cf2c9458407c0d340bbcc05e was the earlier commit, so let’s just look at that one:

$ git log --stat  498b074bd0db2913cf2c9458407c0d340bbcc05e..b32ff09a49fe4c76827e717f911e5a0066bdad4b -- src/regexp.js
		

Some notes:

The output gives me ~20 commits over a two-year period. So there was some work, but not tons of work when you think about how much code churn you’ve created in projects you’ve worked on. I perused these commits looking for interesting changes to investigate. Here are some interesting commits with my plain English summary of what happened, that I put directly into my YAML:

With all of this information, I wrote up my final report in my CVE-2011-3092.yml.

Useful links for Chromium

You might find these links useful:

Another Example: CVE-2013-6665

Prof. Meneely has done another example for you to reference. You can read about it in CVE-2013-6665.yml

Common YAML Questions

Does indentation matter?

YES! Indentation has semantic meaning. Imagine YAML as if a Python person was was redesigning JSON to be more human-friendly.

What is with the pipe character?

This is a literal block, which means that newlines are preserved. Again, indentation matters here too.

Should I use string keys or symbol keys? (e.g. commit: or :commit:)

Keep with what was written in the file. We tend to be flexible toward both. We prefer string keys.