Overview

So far in your computing career, you have probably have always assumed that your input is well-formed. Indeed, we as instructors always put that assumption in… can you guess why? Because handling bad inputs is a whole subsystem unto itself. And it’s messy. That’s what we’ll be doing today.

The best way to develop and test an input handling system is by using automated unit tests. These allow us to explore all kinds of normal cases and boundary cases. And best of all, we can rerun everything and know that we’ve handled those cases properly.

Technology

There’s nothing special about our technology choice here. We could do this project in any industrial-grade programming language.

In this project, we’ll be using Javascript via Node.js. Why? Because it’s one of the most widely-used languages today. Javascript has it all: easy-to-use APIs, robust unit testing libraries, extraordinarily well-documented, i18n support, weird sad emoji-looking curly brace notation, and bizarre unpexpected legacy idiosyncracies that make you want scream. The material we cover here will all be in the context of Javascript, but note that this could be done in any serious programming language.

We will assume you know nothing about Javascript specifically, and that you have worked with Java and Python. That being said, Java and Python programmers will feel right at home in JS.

Setup

These setup instructions will get you running on a basic Javascript module in Node.js. This has been tested on Mac, Windows, and Linux. Full disclosure: this was written primarily on a Windows machine.

You will need to install Node.js. We will use the latest long-term support version, which happens to be v14.17 at the time of this writing.

Install the the most recent version of Node.js long-term support version. Note: if you need have another version of Node.js installed, feel free to use that instead of managing multiple Node versions. Our Node.js version choice is not critical for this exercise.
Log into http://kgcoe-git.rit.edu using your RIT username and password. We are not using gitlab.com.
Create a repository on the KGCOE GitLab, called input-handling. It should be private.
Make your instructor and course assistants the Reporter role on the repository.
Clone the reposistory locally using your favorite Git client.

Note if you are in the SE labs, we recommend cloning to somewhere on c:\ (e.g. c:\yourusername\input-handling) not z:\. Be sure to push your code when done!

Open your repository folder in your favorite text editor.
Create the following file and directory structure. All of these files will be empty to start with, and we’ll fill them in one-by-one.

input-handling/
├── .gitignore
├── .gitlab-ci.yml
├── package.json
├── src/
│   ├── book-validator.js
│   ├── book-validator.test.js

Fill in package.json with the following contents:

{
  "devDependencies": {
    "@babel/core": "^7.14.6",
    "@babel/node": "^7.14.7",
    "@babel/preset-env": "^7.14.7",
    "babel-jest": "^27.0.6",
    "init": "^0.1.2",
    "jest": "^27.0.6"
  },
  "scripts": {
    "test": "jest"
  },
  "jest": {
    "testEnvironment": "jsdom"
  },
  "dependencies": {
    "sanitize-html": "^2.5.1"
  }
}

Run npm install. This will create a folder called node_modules and package-lock.json.
We don’t want node_modules to be checked into our repository. Add this to your .gitignore file. (Feel free to adapt this to your needs.)

/node_modules/
/coverage

When we push this to GitLab, we want to be able to run this against our continuous integration pipeline. Here’s the config for that. Copy this into the .gitlab-ci.yml file:

tests:
  image: node:latest
  stage: test
  before_script:
    - node --version
    - npm install
  script:
    - npx jest --ci

We’ll be using Jest as our unit testing framework. Let’s create a configuration file for it. Run this from the command line:

npx jest --init

You’ll get a series of questions. Here’s how you should answer them:

> npx jest --init

The following questions will help Jest to create a suitable configuration for your project

√ Would you like to use Typescript for the configuration file? ... no
√ Choose the test environment that will be used for testing » node
√ Do you want Jest to add coverage reports? ... yes
√ Which provider should be used to instrument code for coverage? » v8
√ Automatically clear mock calls and instances between every test? ... yes

📝  Configuration file created at jest.config.js

Check that you have a jest.config.js file now.
At this stage, let’s commit and push our code to the repository. When you do this, check your repository on [kgcoe-git.rit.edu], be sure to make sure that the unit tests are running. Note that there is no code yet, in book-validator.test.js, so you may get errors.
(Optional) Take a look to see if your favorite text editor has an extension devoted to Jest. We used Visual Studio Code to write this, and the Jest plugin provided some nice visuals of coverage and failures.
Let’s get a basic unit test up and running. We’ve adapted this from the Jest documentation. In your book-validator.js file, add this content:

// Sanity check method to make sure you have your environment up and running.
function sum(a, b){
  return a + b;
}

// Too all my JS nitpickers...
// We are using CommonJS modules because that's what Jest currently best supports
// But, the more modern, preferred way is ES6 modules, i.e. "import/export"
module.exports = {
  sum,
};

Then, let’s make a unit test in our test file. In book-validator.test.js add the following content:

const v = require("./book-validator");

// Make sure that our normal test works so our environment is all set up.
test('SANITY CHECK: 1 + 2 = 3', () => {
  expect(v.sum(1, 2)).toBe(3);
});

To run your tests, run npx jest. Hopefully, your console will look something like this:

PASS  ./book-validator.test.js
✓ SANITY CHECK: 1 + 2 = 3 (5ms)

Part 0: Jest and JS Walkthrough

Let’s dissect this Jest test. Feel free to skip this if you are comfortable to.

The basic format of a test looks like this:

test('description of your test here', () => { /*your test */} )

The test function is from Jest.

The first argument is a string of what you’ll see on your console to describe your test. It’s like the unit testing version of a comment. Make it brief and descriptive - don’t just copy-paste your code but describe it for colleagues or future-you. In this project you’ll mostly be copy-pasting our tests over.
The second argument is a function containing your test code. We’re using Javascript Arrow functions here. It requires no arguments, hence the (). If we didn’t have this arrow function here, Node would try to execute our code when first parsing our file, not when Jest needs to invoke it. In other words, the execution gets delayed until we want it to. This is a very common construction in the Javascript world.

Then, we use another Jest method, called expect. This convention follows a style called Behavior-Driven Development, popularized by Ruby’s rspec. It’s functionally the same as classical unit testing, but structured to be more readable. This expect structure is like:

expect(actualValue).toBe(expectedValue)

actualValue is the code-under-test you’re executing. No need to further wrap this in an arrow function.
toBe is a matcher. This is exact equality, i.e. ==. There are plenty of other matchers in Jest, but we’ll use toBe for most of this project.

For most of our tests, we’ll also use describe to group tests together. Typically we use describe per API method you’re testing. This is great for reporting Jest results, as well as using code folding in your editor to hide tests you’re not working on. For example, we could do this to our existing test:

describe('sum', () => {
  test('SANITY CHECK: 1 + 2 = 3', () => {
    expect(v.sum(1, 2)).toBe(3);
  });
})

Finally, let’s talk about running our tests. We highly recommend using the --watchAll command line option to Jest. This keeps Jest running at all times and will re-run anything that changes. It’s super responsive and lets you iterate very quickly! Run it with npx jest --watchAll. If you have a second monitor, put this on your second monitor!

Part 1: Basic Allow and Block

Alright. Let’s handle some inputs. In this scenario, we are making a set of methods to be used in a municipal library system.

Update your book-validator.js file to have all of module exports for our whole API:

// Too all my JS nitpickers...
// We are using CommonJS modules because that's what Jest currently best supports
// But, the more modern, preferred way is ES6 modules, i.e. "import/export"
module.exports = {
  sum,
  isTitle,
  countPages,
  cleanPageNum,
  isSameTitle,
  cleanForHTML,
};

Add stubs for each of your API methods. Like this:

function isTitle(str){ return false; }
function cleanPageNum(rawStr){ return 0; }
function isSameTitle(strA, strB){ return false; }
function countPages(rawStr){ return 0; }
function cleanForHTML(dirty) { return dirty; }

Let’s start with isTitle. This is an input validation method because it returns a boolean. The system will reject the input if it’s false, and accept it if this function returns true. Add this comment above isTitle:

/*
  Valid book titles in this situation can include:
    - Cannot be any form of "Boaty McBoatface", case insensitive
    - English alphabet characters
    - Arabic numerals
    - Spaces, but no other whitespace like tabs or newlines
    - Quotes, both single and double
    - Hyphens
    - No leading or trailing whitespace
    - No newlines or tabs
*/
function isTitle(str){
  return false;
}

Let’s start with some obvious titles that should be allowed. Add these tests to your book-validator.test.js file:

/**
 * @jest-environment jsdom
 *
 * ^^^^^^^^^^^^^^^^^^^^^^^-magic comment for Jest's DOM tools. This MUST be at the top.
 * Seriously - the /** part needs to be LINE 1 of your file or jsdom will fail. NOTHING above it.
 * Have I mentioned that I hate magic comments? Such confusion.
 * Also: make sure that Line 3 is JUST that space and asterisk (i.e. an "empty line" after the @jest part)
 *
 * Jest Unit Tests for the book-validator set of functions
 */
describe('testing isTitle', () => {
  test('single letter',     () => { expect(v.isTitle('A')).toBe(true) });
  test('simple title',      () => { expect(v.isTitle('War and Peace')).toBe(true) });
});

Note: my formatting here is intended to keep the same code vertically-aligned. It makes my code look more like a test plan table.

Also note: there’s a magic comment that needs to be at the top of this file. Add that in now and it’ll make more sense later on. Don’t put anyting above that comment, or Jest won’t run correctly.

Our isTitle test should now fail because our method does nothing. Make it always allow any string before moving on.
Tests pass! But now we have an insecure method and no security tests.
Let’s start with a block list. A block list is a list of specific, banned strings. (Historically, some sources called this a “blacklist”.) While these are helpful, you will find they are woefully inadequate to keep up with the creativity of the crowds. But, updating a block list in the middle of a widespread malware attack is a great way to shut down specific attacks. Add this test:

test('Block list',        () => { expect(v.isTitle("Boaty McBoatface")).toBe(false) });

To get this test to pass:

In book-validator.js, define an array constant of hardcoded strings
Include "Boaty McBoatface" in that array
In isTitle, check that the test fails when the required string is not in the list. Tip: use the some() method (docs here) with an arrow function to make this a simple one-liner.
Another tip: avoid the overly-complex return true else return false code smell and just return the value of some()

Tests pass!
This is pretty restrictive. Let’s also check against upper and lower case letters. Add this test case, then improve your block list check to ignore case. You will have to look up how Javascript strings convert case.

test('Block, mixed case', () => { expect(v.isTitle("bOaTy McBoAtFaCe")).toBe(false) });

Tests pass!
Reality check. Ordinarily, hardcoding a block list like this is a bad idea. You generally want your block lists to be extracted to a separate system so you can update the block list without doing a full release of your software. Block lists are all about reacting to the moment. But we’ll keep it simple here and move on.
As you can see, enumerating all of the clever riffs on a single exploit is just a fool’s errand. Instead, let’s define what is allowed. That is, an allow list, or a set of rules to define what is allowed. (Historically, some sources called this a “whitelist”.) Let’s say that we want to restrict the letters to our titles to the following characters:

English alphabet, both cases
Arabic numerals
Hyphens
Quotes, both single and double

Add the following tests for these:

test('English chars',     () => { expect(v.isTitle("A-z 1")).toBe(true) });
test('single quote',      () => { expect(v.isTitle("'")).toBe(true) });
test('double quote',      () => { expect(v.isTitle('"')).toBe(true) });
test('Allowed chars',     () => { expect(v.isTitle("'Ok'\"boomer\"")).toBe(true) });
test('Poop Invalid',      () => { expect(v.isTitle("💩")).toBe(false) });

How should we do this? We could certainly make an array of “allowed” characters and check each character against it in a simple loop. But what if we start doing more complex structures? Like not allowing the first letter to be a number? Or not allowing four vowels in a row? That gets complicated quickly. The best tool for the job here are regular expressions. Here’s a primer on regex.

Think of regular expressions like flexible search terms.
Regex’s are a language unto themselves, and we’ll cover some of their most critical features here for input handling.
We can use regular expressions as a test, to determine if our expression fits the string, or a capture to extract the parts of a string that we need. We’ll start with test.
In Javascript (see docs), regular expressions can be expressed as literals when delimited by the forward slash / character, like this:

let regex = /abc/;

We use regex quantifiers to denote how we want repetitions. For example, the * denotes “zero or more” and + is “at least one”. Here are some examples:

/a+/.test('a')  // true
/a+/.test('aa') // true
/a+/.test('b')  // false
/a*/.test('')   // true
/a*/.test('b')  // true... surprised? well it has no a's right?

We use regex character classes to define what characters we are talking about. If we state simply a letter, then that’s a character class of just that letter. Otherwise, we can specify lots of letters in square brackets [] and each character is now part of that set of valid characters, for example:

/[a]+/.test('aa')    // true, functionally the same as /a+/
/[ab]+/.test('abba') // true
/[ab]+/.test('c')    // false

Character classes have ranges that allow you to include tons of sequential characters based on the character set. This allows you to be more concise:

/[a-c]+/.test('bacaa') // true
/[a-c]+/.test('d')     // false
/[A-Za-z]+/.test('d')  // all letters, regardless of case

Built-in character classes are for common situations. Here are some helpful ones:

/[\d]+/.test('123') // arabic digits
/[\w]+/.test('abc123_2') // alphabet, digits, and underscore, i.e.[A-Za-z0-9_]
/[\s+]/.test("\t \r\n\f")/ //whitespace, in its many, many forms
/[\-]+/.test('---') // since hyphen is special, we use \- to escape it

In regex, whitespace MATTERS. So / / means “a space character”, and /a b/ means “a space b”.

With the above primer reviewed, add to the logic of your isTitle() function the ability to check that:

the input string is NOT in the block list, AND
the input string DOES match a regular expression matching our allow list (specified above)

Tests pass! Now, let’s try to break it. Let’s add a string that has some good and some bad characters. Add this test case:

test('Anchor drop!',      () => { expect(v.isTitle("ok💩")).toBe(false) });

Did your tests fail? Mine did. Why? Because the regex test() method doesn’t test if the ENTIRE string matches your regex, it only checks if a matching string can be found somewhere in your string. So ok was in the string and it quit without looking at the rest of the string. This is such a common mistake, it has its own entry in the Common Weakness Enumeration, CWE 777.
Fortunately, it’s got an easy fix. Just add anchors! An anchor defines the beginning or end of a string. Anchors are regex assertions.

In JS, it’s ^ for the beginning, and $ for end.
Note that ^ inside of charcter classes means NOT, e.g.

/[^a]/.test('a') //false, i.e. "not a"
/^a/.test('a') //true, i.e. "the line must start with a"

I know I know… it’s confusing. I didn’t make the rules but we’re stuck with it. It’s also confusing to remember that ^ means “beginning” and $ means “end”. My personal mnemonics are:

“it’s the opposite of their order on my keyboard”, or,
“dollar at the beginning would be confusing for dollar amounts, so they made it at the end… and caret is just the other one”.

Either way, you don’t have to memorize anything if you have good unit tests.

Add anchors to your isTitle regex. Tests pass!
Reality check. Regex is one of those regional dialects that has tiny differences depending your tech stack. For example, Ruby’s ^ and $ is for the beginning and end of a line not the string. (Ask me how I know.) All the more reason to have lots of unit tests whenever regex is involved. A good source of explanation for all the different dialects of regex is regular-expressions.info, just make sure you’re well-caffeinated before visiting.
But what if someone decides to register a title of a book called "The Hobbit" and another person registers a book called " The Hobbit "? The second is visually indistinguishable from the first! That can lead to spoofing or phishing attempts by confusing users, or CWE 451 . The issue here is leading and trailing whitespace. This shouldn’t be allowed for a book title. So let’s check for that:

  test('Leading spaces',    () => { expect(v.isTitle('   a')).toBe(false) });
  test('Trailing spaces',   () => { expect(v.isTitle('a   ')).toBe(false) });

There are a variety of ways to check for this. You could use regular expressions, yes. You could also use Javascript’s trim() method and check against the original. I personally prefer the latter because it’s more readable, but if performance was critical I’d go with a regex.

Update your isTitle method to handle leading and trailing whitespace.

Tests pass!
Speaking of confusing whitespace, what if we had a non-space whitespace? Did you know that vertical tabs exist? Add these tests and improve your regex accordingly.

  test('evil tab',          () => { expect(v.isTitle("a\tb")).toBe(false) });
  test('evil newline',      () => { expect(v.isTitle("a\nb")).toBe(false) });
  test('evil Win newline',  () => { expect(v.isTitle("a\r\nb")).toBe(false) });
  test('evil form feed',    () => { expect(v.isTitle("a\fb")).toBe(false) });
  test('evil vtab',         () => { expect(v.isTitle("a\vb")).toBe(false) });

Null is always something that confuses systems. Sometimes the word “null” can get deserialized into an actual null object (see this entertaining article). Here, we’re not doing anything fancy with objects or C-style null terminators. But just out of superstition let’s add a few checks. Add these to your test file:

test('Null char invalid', () => { expect(v.isTitle("asdf\0")).toBe(false) });
test('null word valid',   () => { expect(v.isTitle("null")).toBe(true) });

In my case, these passed immediately. Make sure all of your tests pass before moving on to part 2.

Part 2: i18n and Unicode

Now this is going to get fun! As it turns out, and this may come as a surprise to some of you, the world has languages OTHER than English!! For example, to say that the character class \w represents “all letters” is… wrong! So how do we allow for book titles that include international letters, without allowing every character?

The topic here is i18n, or internationalization. Why i18n? Because programmers are bad spellers and got tired of misspelling it. So it’s i then 18 letters then n. People also use this for localization, as in l10n (which is annoying to those who have fonts that don’t easily distinguish l and 1 so L10n is better, but I digress).

Before we get to the tech, let’s step back and talk ethnocentrism. Every human is enthocentric. What we experience is “normal” and what we don’t experience is “weird”. Without malicious intentions, we will always have a cognitive bias toward what is familiar to us. Technology is no different. Since most of the initial computing pioneers were native English speakers, there are i18n blindspots everywhere. And where we have blindspots are opportunities for exploitation. That’s why, historically, there are lots of vulnerabilities are related to i18n and input handling.

How do we represent all international characters? Welcome to the wonderful world of Unicode. The standard today, UTF-8, has all kinds of clever ways of representing text.

Read this 2003 from Trello and Stackoverflow SE legend Joel Spoelsky. Despite its age, it holds up! (Other than considering UTF-8 being “new”, but UTF-8 is now the undisputed standard today.)
Here’s some other key Unicode facts to know:

The translation from a number to a letter is called an encoding. This is not cryptography of any kind, it’s just a common table that the world has agreed upon.
The word for the number that translates to a character is the code point, e.g. if I want to represent the letter n literally in a string, I would do "\u006E", and the 006E is the code point.
Strings don’t usually “know” what encoding they are. There is nothing in the abstract data structure of a String that keeps this information. This is one of many reasons that strings are the most complex data structure in all of computing (change my mind!)
Most high-level programming languages, like Javascript, use an Object to represent a string, with an array of bytes and the intended encoding as a separate field in that object.
Auto-detecting encoding is fraught with problems. Many strings can be constructed that have ambiguity between different character sets. It’s better to know and maintain that information than it is to auto-detect.
Converting from one encoding to another can also be a big challenge because of the auto-detecting problem. Definitely possible, but best left to the experts instead of rolling your own.

Let’s expand our definition of “letter” to be Unicode letters. Add this test case, and it should fail because ß isn’t in our character class.

test('german allowed',    () => { expect(v.isTitle("Ich weiß nichts")).toBe(true) });

What to do? The UTF-8 standard has categories for things like math symbols, emojis, and letters (among others). Javascript Regex allows us to actually make use of this! Use the unicode property checker in the JS regex to allow \w AND letters, i.e. \p{Letter}.
Tests pass!
Let’s talk diacritics. A diacritic is a special character that modifies the look of a letter. For example, the ñ in Spanish can be represented by two characters: "\u006E" (or n), and then the diacritic "\u0303" (or ~). In many cases, UTF-8 has a separate entry for common diacritical combinations, so ñ could actually be "\u006E\u0303" OR the singular "\u00F1". If someone is going to copy-paste a title with ñ, it’s impossible to know which one they copied! What a mess. Let’s test for exactly this:

test('ñ composed',        () => { expect(v.isTitle("ma\u00F1ana")).toBe(true) });
test('ñ decomposed',      () => { expect(v.isTitle("ma\u006E\u0303ana")).toBe(true) });

For me, the second test failed. So how do we handle this? Javascript’s normalize method is for this exact problem. The UTF-8 standard has hints within it that allow for quick conversions, and normalize will do the conversion to a normal “composed” form or “decomposed” form. Take a look at this method’s documentation. Then, update your code so that, before checking your string against your title regex, your normalize your raw string to the composed form.
Tests pass!
What about ligatures? A ligature is a special type of character that is combined with another character. In English this might be typically æ or ﬀ. Some languages, such as the Devanagari script make extensive use of ligatures where most letters in a single word will combine together, giving each word a unique look. Let’s test a few ligatures: When it comes to UTF-8, some ligatures have a composed form and decomposed form, such as ﬀ. Others are ones you will need to handle separately.
What about bidirectional script? Some languages, such as Hebrew or Arabic, go right-to-left instead of left-to-right. Ordinarily, a system configured to that locale would simply reverse the displays everywhere. But what if you want to embed a small right-to-left string in a left-to-right environment? Let’s add this test case for now, and we’ll revisit this in the next section. Add this test case to make sure Arabic is allowed:

test('arabic allowed',     () => { expect(v.isTitle("مرحبا بالعالم")).toBe(true) });

Try this. Move your text cursor to the middle of the string - notice how it changes direction! Interesting, right?

Take a moment to read about the 2021 vulnerability called Trojan Source. (We’re only asking you read the front page, but the academic paper is very interesting and well-written if you want to learn more!)
Tag your work as part2 for any intermediate grading once you have all the of parts 0, 1, and 2 working. Make sure you push to gitlab.

Part 3. Cleaning Input

Have you ever seen Google say “nope, invalid input”. It doesn’t! It takes in anything and does its best with the input. So you can’t just say “don’t allow weird characters” and call it a day. Instead, you need to clean the input by taking what you can and working with it as best you can.

For this part of the project, let’s make a new method: isSameTitle. Here is the stub for it:

/*
  Are the two titles *effectively* the same when searching?

  This function will be used as part of a search feature, so it should be
  flexible when dealing with diacritics and ligatures.

  Input: two raw strings
  Output: true if they are "similar enough" to each other

  We define two strings as the "similar enough" as:

    * ignore leading and trailing whitespace
    * same sequence of "letters", ignoring diacritics and ligatures, that is:
      anything that is NOT a letter in the UTF-8 decomposed form is removed
    * Ligature "\u00E6" or æ is equivalent to "ae"
    * German character "\u1E9E" or ẞ is equivalent to "ss"
*/
function isSameTitle(strA, strB){

}

Next, copy and paste these test cases. That last one is an example of Zalgo text, an obscure internet meme.

describe('isSameTitle', () => {

  test('simple same',         () => { expect(v.isSameTitle('a', 'a')).toBe(true) });
  test('different object',    () => { expect(v.isSameTitle(new String('a'), new String('a'))).toBe(true) });
  test('not strings',         () => { expect(v.isSameTitle(1, null)).toBe(false) });
  test('leading trailing ws', () => { expect(v.isSameTitle(' a ', 'a')).toBe(true) });
  test('leading trailing ws', () => { expect(v.isSameTitle(" a \t", "a")).toBe(true) });

  test('hindi',               () => { expect(v.isSameTitle("नमस्ते दुनिया!", "नमस्ते दुनिया!")).toBe(true) });
  test('hindi different',     () => { expect(v.isSameTitle("नमस्ते दुनिया!", "अलविदा")).toBe(false) });

  test('mandarin',            () => { expect(v.isSameTitle("你好!", "你好!")).toBe(true) });
  test('mandarin different',  () => { expect(v.isSameTitle("你好!", "再见")).toBe(false) });

  test('multiple diacritics', () => { expect(v.isSameTitle("a\u0321\u031a", "a\u031a\u0321")).toBe(true) });

  test('mañana NFD vs NFC',   () => { expect(v.isSameTitle("ma\u00F1na", "ma\u006E\u0303na")).toBe(true) });

  test('ñ and n compat',      () => { expect(v.isSameTitle("ma\u00F1ana", "manana")).toBe(true) });
  test('ñ and n compat',      () => { expect(v.isSameTitle("ma\u006E\u0303ana", "manana")).toBe(true) });

  test('ligature ff and ﬀ',   () => { expect(v.isSameTitle("ff", "\uFB00")).toBe(true) });
  test('ligature ae and æ',   () => { expect(v.isSameTitle("ae", "\u00E6")).toBe(true) });
  test('german ẞ and ss',     () => { expect(v.isSameTitle("ẞ", "\u1E9E")).toBe(true) });

  test('bidi compat',   () => { expect(v.isSameTitle("abc\u202Edef", "abcdef")).toBe(true) });




  test('zalgo',   () => { expect(v.isSameTitle("zalgo", "z̸̢̡̨̢̢̨̧̨̨̧̧̧̢̡̢̢̧̨̨̨̢̧̡̢̛̛̛̛̘͎̫̥͙̙̫͈̯̱͍̪͇̻̥̟̥̮̞͈̟̮̼̙̮͈̫͍̠̟̖̱̬̝̩̲̪͔̝̪̥͕̬̺̠̝̖̥͈̲̱̪̣͚̫̩̞̼̠͔̲͉͉̳͉̰͎̖̠͕̩̟͉̲̣̥̬͖͚̫̲̣̟̱̜̰͉̥͎̱̰͉̫͉̳̯͖͓̣͖̖̤͙̙̹͍̪̬̱̭̤̩̠̝͖̞͙̳̠̗̳͈͚̭͖̩̯̪̼͙̮͇̟̘̹̗̜͓͔̬̫͕̖͙̖̩̹̺͎̮͙̗͇̦͕̞̞̪̩̙̞̥͇͓̼̹̭̟̭̻̬͈͍̥͚̖̯̟͔̹̮̫̳̘̪̗̱̣̟̖̯͉̞̱̗̤̟͓͓̥̥͈͈̯̖͕̝͔͚̺͉̞̫̰̥̮͔̣̝̞̬͔̼̞̯͇̖̪̘͕̪̠̀̓̔̈́͑̀̄̿̎̉́̏͑̀̄͛̿̾̈́͊͐͐͗̾̍́̄̅́̒̈́̆̀̾̌͌́̈́́̄̀͒͑́̾̃͊̃͛̍̓̒̾͆̏̈́̾͂̌̊͆̊̇̆͗͛̓͑͐͌̈́̌̓̓̇̓̅͌̃̄̀͐̃̓͐̉̐͊͆̓̈͗̈̎̽̉͌́̿́͗̃̈́͒̃͛̿̆̅̅͐͆́̆̀́̎́͐̽̐̈́̀̀͛̽͋̈̏͗̎̑͑̈́͑̾͒̀̚̚̚͘͘̕͘̕͘͜͜͜͜͜͜͜͝͠͝͝͠͠͝͝͝͝͠͝͠͝͝͝͝͠ã̵̧̢̨̧̢̨̨̨̨̛̛̛̛̛̛̝̞̟͕̮̱̼͕͖͚̭͓̲̹͇̼̦̟̠̭̖̤͙͉͇̣̮͓͔͖͕̙̤̗͇̩͈̙͈͎̭̣̼͇̙̼̬͓͖̗͙̪̟̪͚̙̗̜͎͙̞̘͖̗̦͙͎̻̖͉͔̣̩̹̟͈͙͎̲͚͉͕̃̏͂̃͌͑̆̅̎̃̒̈̓͂̃̊͑͆̏̉̋̔͊͋͛̎̂́̎͂̒̋̂̃͛̓̈́̆̾̓̈́̾̎́̄̿̈́͌̈́̓̍̈́̌̍͗̂̀̏́̍̐̉̏̊̆͑̊̄̅́͆̈́͊̈́͛͆́̽̅̈́̈̂̌̍̔̔̌̋̑̈́̓́̋̑́́̏̈́̾̑̽̔̔́̈́̍͛̿̆̌̋̃̌̂͌̀̏͒̓̈́̉̎́̒́̀̀̔́̉̋̀̀̽̈́̿̓̀̒͂̾̐̇̓̈́͆͆̀͆̅͒̌̂́͂̓̍̏͐̃̒̀̂̿͗̍̈́͌̇̇̑̇͋̿̔̑͂̅̓̀̊̊͐̽́́̓̀̐̉͗̀͗̔̀̍̉̉͑̋̎̃͋̏̉́̄͗͑̑̉͋̽͒͂̈͑́̎̄̍̾̈́͒͂̔̕̕̚̕͘͘̕̚̕̚̚͜͜͜͜͜͝͝͝͠͝͝͝͝͝͝͝͝͠͠͠͝͝͠͝͝͝͝͠͝͠͝ͅl̴̢̧̡̨̢̨̧̨̨̡̨̧̡̨̢̨̧̨̡̛̛̞̺̞̘̣͍͕͕̗̞̞̼̮̻̰͔̺̘͉͖͚̫̞̯͈͉̣̲̘͎̼̱̺̞̮̘̹͙̬̪͓̝̭͖̳̱͖͈͚̯͔̹̩̳̩͍̣̹͔̹̺̭̖̜͙̻̰̺̝̦̟̯̪̞͉̝̩̩̮̜̫̼̗͙͖͚̲͈͙̱̰̥̠͎̬̮͓̬͔̪͕̯͍͙̼͙͎̣̖̥̪͇͍͕͎̥̫̙͔̖̮̬͔̟͈̯͙̺̠͔̦̱̩̱̝͖̺̳̜̪̳͓̮͔͉̰̻̬̖͚͕̪̼̙͇̼̬͚̳͎̺̼̠̜̩̟̩̘̳̱̝̫̲̖̙͉͕͇̝͚̺̫̜̜̣̳̺͇͍̬̙̼̗̲͕̜̘͚̤̥̺͎͐͒̆̉̏̓̋̏̀́͑́̌́͑͂̎̃̈͛͐́̀̂͒̐̍̀̈́̒̓̊͒̈́̈́̊̍͊̿̾̊̾̎̋̓̇̃͐͆̔͑̓͗̏̈́̆͌͂̊̑͗̀̔̍̉͗̎̊͗̈́̽̉͆̒̓̾̈̽̑́̂̒̌̀̈́͗̏̎̋̍̐̓̈͗̆̆́̃͐̅͊̈͋͐͊̀̃͑͑́̈̐̄͗̈̓̿̇̉̈́̏̀̌̓́̈͐̅͐̃̽͊̍̈̉̆̈͋͐̐̀̈́̉̃̔͆́͆́̎̀͊̌̄̎̓͋̈́͐̄̽̕͘̕̚̚̚̕̚͘̕͘͘̕͘̕̚͘͘̚͜͜͜͜͝͝͝͠͠͠͝ͅͅͅͅͅͅg̶̢̡̡̢̧̧̧̨̨̡̢̧̡̨̨̢̧̨̢̨̧̧̨̢̡̢̢̨̛̛̛̰͔̩̠̲̬̗͉̥͓͚̟̮̣̠̞̪̞̗̘̥̙̥͖͕̘̬̖̩̘̰̤̫̗̲̬̘̠̠͓̘̖̯͉̦̝̣̺͎̥̟̻̺̱̝͙͍̙͚͓̦̦̩̪̥̜͎̦̘̝̖͔͔̙̠̖̮̪̼͔͈͖͎͎̳͈͎̗̹̪̫͕̦̩̬̤͙̙͇͙̱̫̭͖̤͚̠̖̮̭̞͖̫̯͖̰̮̟͎̟̠͉̙̞̣̟̺̲͎̹̲͉̜̝͖͎̻̞̣̮͚͓͍̲͓̣̗̱͉̗͓̬͎̹͈̣̝͙̝̙̮̦͓̭̯͓̦̻͇̤̣̥̘̠͈͈͕̬̘͕͙̙̼̣̹̮̞͚̦̬̟͖͓̞̳͚̗̠̩̰͍̤̩̙̞͉̼̯̹̫̤͐͆͗̍̓̈́͊̋̈́͊͒͛̈́̓̇͐͆̄̀̑͒̂̓̃̿́͒̈́̋͐̈́̄̒͌͐̿̎͋̌͆͛͒̆͛̔̂̈̈́̍̿̑̃̽͊́̂͆͌͑̈́̇́̉̄̉͘͘͘͘̕͘̕̚͜͜͜͝͠͠͠͠͝͠͝͝͝ͅͅͅơ̸̧̢̧̡̨̨̢̡̨̢̢̢̧͔̦̭̘̱̳̳̹̠̲̦͍͎̦͚̠͍̥͚͇̠̬̗̳̙̪̦̞̬̮̖͚̭͕͇͚͙͉̩͙̳͖͔͉̱̮̱̤͈̫̫͔̲͈̥̰̲̭͕̼͕̬̮̜͈̳͈͕̻̦̙͔͕̱̰̥̖̩̮͉͉̗̮̩͇̱͔̘̩̠̏̄͋̅̂̔͐̈́̾̏̿̈̑̊̒̽̔̕̕͘͜͜͜ͅͅͅ")).toBe(true) });





});

At this stage, you have everything you need to make these tests pass. Incrementally make the tests pass by handling both inputs and returning a boolean. This might take some time! But when you’re done, you’ll have a robust isSameTitle method that will react to raw inputs without having to sanitize them.

While the localeCompare() method may be useful here, we ask that you work with regular expressions instead.

The above is an example of cleaning input for a specific purpose. Now let’s talk about cleaning output. The most famous place where we need to clean outputs are in Cross-Site Scripting (XSS), where we want to make sure that the markup from our output will never actually be interpreted as markup or scripting.

Part 4. Parsing with Regex Capture Groups

For this exercise, we need to do some parsing from a structured string. Earlier, we discussed some simple regular expressions and using test. Now, let’s use another feature of regex: capture groups.

Update your countPages method stub to this:

/*
  Page range string.

  Count, inclusively, the number of pages mentioned in the string.

  This is modeled after the string you can use to specify page ranges in
  books, or in a print dialog.

  Example page ranges, copied from our test cases:
    1          ===> 1 page
    p3         ===> 1 page
    1-2        ===> 2 pages
    10-100     ===> 91 pages
    1-3,5-6,9  ===> 6 pages
    1-3,5-6,p9 ===> 6 pages

  A range that goes DOWN still counts, but is never negative.

  Whitespace is allowed anywhere in the string with no effect.

  If the string is over 1000 characters, return undefined
  If the string returns in NaN, return undefined
  If the string does not properly fit the format, return 0

*/
function countPages(rawStr){
}

/*
  Perform a best-effort cleansing of the page number.
  Given: a raw string
  Returns: an integer, ignoring leading and trailing whitespace. And it can have p in front of it.
*/
function cleanPageNum(str){
}

In this example, we are going to parse a string with a non-trivial structure to it. The format of this string resembles a list of pages you might see in a citation, or in the print dialog when specifying pages. Breaking this down, we will need to:

Convert strings to integers
Handling singular numbers
Handling ranges of numbers
Sometimes the letter “p” can be in front of a number
Handling a comma-separated list of either singular numbers or ranges
Adding up those counts in total

To use capture groups, we need to add parentheses to our regular expression. Anything inside parens is considered part of a group, anything outside still must match but is discarded. Add this test case to your tests, and it should fail.

describe('js capture groups', () => {
  test('p11 matches the p but gets the 11 only', () => {
    const input = "p11";
    let answer = "change me to a regex capture group";
    expect(answer).toBe("11") });
});

To get this to pass, we want to enforce that the string starts with p, followed by a bunch of digits. But when it returns we want just the digits. To do this, we need a regular expression like this: /^p(\d+)$/.

In this case the capture group is (\d+) or a sequence of one or more digits.
To actually get the string captured by (\d+), you use the match method on the string, and that returns a list of strings that matched.
A more readable way to do this is with named captures, which allows us to set the name of the variable we are going to set it to in the regex itself. This adds some self-documentation to the regex, which are notoriously hard to read. Instead of () you embed the name with (?<yourname>).
So to capture the 11 in p11, this line would do it: 'p11'.match(/p(?<page>\d+)/).groups.page

Get the above code to pass by adapting my example with named capture groups.
Now it’s time we get back to countPages and add test cases for that. Here they are:

describe('countPages', () => {
  test('single number',      () => { expect(v.countPages('2')).toBe(1) });
  test('single pNumber',     () => { expect(v.countPages('p3')).toBe(1) });
  test('simple expressions', () => { expect(v.countPages('1,3')).toBe(2) });
  test('range',              () => { expect(v.countPages('2-4')).toBe(3) });
  test('range w/pNum',       () => { expect(v.countPages('2-p4')).toBe(3) });
  test('range w/ws',         () => { expect(v.countPages(' 2 -  4')).toBe(3) });
  test('big range',          () => { expect(v.countPages('10-100')).toBe(91) });
  test('negative range',     () => { expect(v.countPages('100-10')).toBe(91) });
  test('multi range',        () => { expect(v.countPages('1-3,5-6,p9')).toBe(6) });
  test('neg multi range',    () => { expect(v.countPages('9-5,1-4')).toBe(9) });
  test('weird range split',  () => { expect(v.countPages('1-3-5')).toBe(0) });
  test('garbage',            () => { expect(v.countPages('asdcmiuf')).toBe(0) });
  test('unicode weirdness',  () => { expect(v.countPages("\0x00\xa2")).toBe(0) });
  test('integer overflow',   () => {
    const overflow = `p${Number.MAX_SAFE_INTEGER}-p0`
    expect(v.countPages(overflow)).toBe(undefined)
  });
});

describe('cleanPageNum', () => {
  test('single number',        () => { expect(v.cleanPageNum('2')).toBe(2) });
  test('single pNumber',       () => { expect(v.cleanPageNum('p3')).toBe(3) });
  test('whitespace',           () => { expect(v.cleanPageNum(' p4\t \r\n')).toBe(4) });
  test('two pNums',            () => { expect(v.cleanPageNum('p3p4')).toBe(undefined) });
  test('exponents  undefined', () => { expect(v.cleanPageNum('1e7')).toBe(undefined) });
  test('nothing usable',       () => { expect(v.cleanPageNum('abc')).toBe(undefined) });
  test('js max number',        () => { expect(v.cleanPageNum('abc')).toBe(undefined) });
  test('negatives undefined',  () => { expect(v.cleanPageNum('-19')).toBe(undefined) });
  test('leading zero octal?',  () => { expect(v.cleanPageNum('p09')).toBe(9) });
});

Make the tests pass. This is a tough one! It might take you as much time as it took to get here! Take it one step at a time. Here are some hints:

My implementation separated out to a couple of utility methods, like cleanPageNum and a removeWhitespace. I’m including my cleanPageNum tests.
My implementation used the split function instead of capture groups in one particular place, but you could do either one
You probably don’t want to do this all with one regex. You might be able to, but you have to think about maintainability and your sanity.
The order of tests from top to bottom is the order that I wrote them in. Start simple!!
JS weirdness: If you define a regular expression to have the //g option, be sure to make it declared inside your function, not outside as a constant. The global option on Javascript regular expressions means the RegExp object will be modified on subsequent calls to your methods.
Another JS weirdness: if you do a Javascript for-loop over a list of items, make sure you use of instead of in. For example for(const s of listOfStrings). The in keyword gives you the index, not the value.

As you work, make sure you push your code to your repository and that it runs on the CI server. We will grade this output primarily and run locally if there are any issues. You are done if you can get all of these to pass on the CI.

Part 5. Sanitizing Output for Web

Thus far we’ve covered the following input handling techniques:

Input validation via block lists
Input validation via allow lists
Input cleaning for a specific purpose
Input parsing via regex

Now let’s cover one more piece: output cleansing. Why, after all of this, do we need more handling? Haven’t we covered every possible case by now?

The answer is: maybe, but probably not.

But in the interest of defense in depth and frameworks are optional, let’s make another assumption: what if someone forgot to use our input handling system? I like to think of this as the “intern problem” - someone who is new to the team/system/career/company/code/language/etc will not know that this exists. (And no matter who they are, we’ll “blame it on the intern”.) They don’t know what they don’t know. So it behooves us to use a final layer of mitigation, output sanitiziation.

In this project, we’re going to use sanitization in the context of an HTML page. Let’s make the following assumptions:

We’re on the server side, so our job is to render a giant HTML+JS+CSS string that the browser will interpret and execute
Some of that string will be our HTML+JS+CSS
Some of that string will have originated from untrusted sources. We’ll call it dirty in our tests
We want the browser to interpret our string, but not execute dirty strings.

Fortunately, Jest has a really handy tool for this: called jsdom. What this does is give us a document object that acts just like the browser. Via the document, we can make changes to the Document Object Model, aka the DOM. For example, we can have the browser “parse” our strings with something like this:

const dirty=`<script>alert('XSS')</script>`
document.body.innerHTML = `
    <span id="myspan">
      ${dirty}
    </span>
  `
// should be 1 because our dirty string injected an evil DOM element
document.getElementById('myspan').childElementCount

A few JS notes:

Here we’re using the backtick version of JS strings, called Template Literals. This makes string concatenation easier to read, particularly with quote characters. Thus, the final concatenation of the HTML string is:

<span id="myspan">
  <script>alert('XSS')</script>
</span>

By setting document.body.innerHTML we’re asking NodeJS to parse our string into a DOM. We can then query that DOM using the ubiquitous document.getElementById() method
Once we get the DOM element we want, we can use a standard DOM API method childElementCount to see if the browser treated it the way we expected.

Ok, back to our tests.

In book-validator.js, update your cleanForHTML() method stub with this:

/*
  Given a string, return another string that is safe for embedding into HTML.
    * Use the sanitize-html library: https://www.npmjs.com/package/sanitize-html
    * Configure it to *only* allow <b> tags and <i> tags
      (Read the README to learn how to do this)
*/
function cleanForHTML(dirty) {
  return dirty;
}

As mentioned in class, we don’t recommend doing your own XSS output sanitization. It’s just too complicated.. Instead, we’ll use the external library we’re going to use: sanitize-html. Review the README to learn how it works.
In book-validator.js, add the line const sanitizeHTML = require('sanitize-html'); to the top to bring in the library. If you get a dom not found type of error, your magic comment is likely wrong (see comments in Setup)
Next, let’s talk test cases. After working on these test cases myself, I ended up refactoring my code to have a nice utility method. Copy this code into your book-validator.test.js, then review the comment.

/*
  Utility method for testing.
  Take dirty strings and inject them into a DOM string.
  Then, check to see if the dirty string *itself* changed the DOM at all.

  Input:
    - dirty: a string that we don't trust
    - n: the number of child elements we expect to get
*/
function expectDomChildren(dirty, n){
  document.body.innerHTML = `
      <span id="myspan">
        ${v.cleanForHTML(dirty)}
      </span>
    `
  expect(document.getElementById('myspan').childElementCount).toBe(n);
}

Ok, now let’s add a sanity test case to make sure things work. This will pass, even though our cleanForHTML does absolutely nothing.

describe ('cleanForHTML and DOM element XSS', () => {

  test('sanity check', () => { expectDomChildren(`Hello!`, 0) })

});

Next, let’s do a normal test case of what we want. A standard attempt at cross-site scripting is to add a <script> element. So let’s try that, and expect that there are NO children. Add this test just below our previous test in the same describe group:

test('<script> not allowed', () => { expectDomChildren(`<script></script>`, 0) })

Your test case should say that you expected no DOM elements, but we have 1 - that’s our <script> tag. Ok, now let’s put sanitize-html into action. Using their API, without any customization, sanitize the output.
Tests pass!
So what is sanitize-html doing? Presumably, it’s converting characters to their escaped forms. Add these tests:

test('<script> sanitized', () => {
  expect(v.cleanForHTML("<script></script>")).toBe("&lt;script&gt;&lt;/script&gt;")
})
test('heart issue', () => { expect(v.cleanForHTML("<3")).toBe("&lt;3") })

Tests fail! Turns out sanitize-html defaults to discard. Read up on how to configure sanitize-html to escape the output, not discard it.
Tests pass!
Now, sometimes we do want to allow users have some harmless formatting. For us, we want <b> and <i> tags, but nothing else. Add these two test cases:

test('<b> and <i> allowed', () => {
  expectDomChildren(`<b>Bold!</b> and <i>Italics!</i>`, 2)
})
test('<a> not allowed', () => { expectDomChildren(`<a>Non-default!</a>`, 0) })

For me, the first test passed by default, but not the second. Configure sanitize-html for these tests.
Tests pass!
Now, as it turns out, there’s more than one way to do XSS! Sometimes you can gain control over the DOM without creating your own tags. Instead, you can add an attribute to an existing DOM element. It depends on where the string is being interpolated into the HTML string. For example, onload is an attribute that triggers an event. So for example <p onload="alert('HI!')>" would mean that when the browser loads the <p> element (i.e. one page load), that code will be executed. This is why simply escaping < and > is not enough - you need to actually know what your sanitizer is doing and not doing. In a separate describe block, add this test case:

describe ('cleanForHTML and DOM attribute XSS', () => {

  test('attribute exploit WORKS when dirty', ()=>{
    const dirty = `" onload="javascript:alert('hello!')" "`
    document.body.innerHTML = `
      <b id="mine" class="${dirty}">
      </b>
    `
    expect(document.getElementById('mine').attributes.id).toBeDefined();     // ok fine
    expect(document.getElementById('mine').attributes.class).toBeDefined();  // ok fine
    expect(document.getElementById('mine').attributes.onload).toBeDefined(); // uh-oh
  })
});

Sanity check. This test passed the first time for me, but it also doesn’t actually call our cleanForHTML method. Why write it then? Because whenever you do an automated security test, you need to make sure that your code would have worked if your mitigation wasn’t there. Otherwise you could be wasting your time. Thus, the above test case is a “test for our next test”.
Now let’s add our real security test. Add this below our previous test.

test('attribute exploit FAILS when cleaned', ()=>{
  const dirty = `" onload="alert('hello!')" "`
  document.body.innerHTML = `
    <b id="mine" class="${v.cleanForHTML(dirty)}">
    </b>
  `
  expect(document.getElementById('mine').attributes.id).toBeDefined();     // ok fine
  expect(document.getElementById('mine').attributes.class).toBeDefined();  // ok fine
  expect(document.getElementById('mine').attributes.onload).toBeUndefined(); // phew!
})

Uh oh. The onload attribute still got defined! As it turns out, sanitize-html didn’t think of this. (That’s ok, major social media companies made this exact mistake too.) The sanitize-html library assumes that you are giving it valid HTML to parse to start with. Our dirty string doesn’t really look like HTML. So how should we fix this? Well, we’ll have to escape out the quote characters too. Make that change to cleanForHTML - replace all " characters with "
Tests pass!
Don’t forget that HTML has single quotes! Add this test case.

test('attribute exploit FAILS when cleaned, single quote edition', ()=>{
  const dirty = `' onload='alert("hello!")' '`
  document.body.innerHTML = `
    <b id="mine" class='${v.cleanForHTML(dirty)}'>
    </b>
  `
  expect(document.getElementById('mine').attributes.id).toBeDefined();     // ok fine
  expect(document.getElementById('mine').attributes.class).toBeDefined();  // ok fine
  expect(document.getElementById('mine').attributes.onload).toBeUndefined(); // phew!
})

Tests fail! To fix this, you might need to review the OWASP cheat sheet again for escaping quotes. Update your cleanForHTML one last time to get the above test to pass.
Ok but doesn’t HTML also allow quote-less constructions like <b class=foo>. Yep! Don’t do that. We won’t make you do a test for it, but you can see how the back-and-forth can go on forever.

Reality check. You can see how preventing XSS with sanitization-only is a blocklist-type approach. Sanitization requires us to know what is dirty in the first place, and that is entirely context-dependent. Our sanitization routines are only going to be good if we use them properly, and developers are infinitely creative with their mistakes. Thus, our original routine of simply checking isTitle is a very effective one.

Discussion: Input Handling at Scale

Some will say “every method should validate its inputs”. That sounds nice. And it certainly is important to think defensively on every method we implement. But, given how complex input handling is… is that possible?

Single Responsibility Principle (SRP) and Don’t Repeat Yourself (DRY) are two of the most useful software design philosophies I know of. So, if a method that, say, computes cosine should check its inputs AND do mathematical computations, isn’t that breaking SRP? And re-validating inputs can be both a performance hit and not DRY, right?

You can see why strong, static typing is such a popular philosophy in programming language design. Once you convert a messy, unstructured data structure like a string into a specific data structure, you have much less to worry about. So this notion of removing all assumptions about your inputs at compile time and getting down to business is at the core of strongly typed languages like Rust, Java, C#, or Ada. Unfortunately, language type systems can’t fix all input handling issues.

Another, now largely defunct, answer to this was aspect-oriented programming. Input validation was considered a “cross-cutting concern”, so you would implement an “aspect” that attaches to the beginnings and ends of various methods and do input validation. That sounds great until you realize that, when coding in an aspect-oriented manner means that if you call function foo(), you’re not actually calling function foo() first, but jumping to another method that would first check the inputs and that would call foo() for you. If you’re lucky to only have one aspect in play. This non-linearity in code was largely confusing to developers, so (thankfully) aspect-oriented is mainly dead. But you can see how this paradigm would be well-equipped for input handling use cases.

Today, the main answer to input handling is frameworks. You will often see input handling tools built into the APIs you use. For example, an email address is a very complex string (did you know about plus tag subaddressing? …few people do), so many standard libraries and frameworks will maintain their own way of validating it. Developers only need to remember to use them and not just disable validation when it gets in the way. For larger or more custom systems, you will often see input handling be an entire subsystem in one place with a well-defined interface.

Submission

Push your code to your GitLab repository. Make sure that your unit tests all pass in the continuous integration system. Tag your submission as part2 for the initial submission, and final for the final submission so we know which versions to grade.

Our rubric is as follows:

(5pts) Submission instructions followed
(10pts) Maintainability of your solution
The following functions work as expected, as demonstrated by the tests passing:
- (5pts) isTitle
- (10pts) countPages
- (5pts) cleanPageNum
- (5pts) isSameTitle
- (10pts) cleanForHTML

Total: 50 points