Document Analysis Project in C

This project creates a linguistic analsyis of a text document. It reads a text file Unix/Linux format and provides analytical results such as word frequency for every unique word in the document. NOTE -- all evaluation will be by calls to the instructor's set of unit tests.

You can create your own user interface if you want but you must not alter the built-in unit test call at the beginning of main().

You are allowed to use all standard library functions provided by the compiler such as string.h, ctype.h and many others. You are not allowed to use code you find on the internet or other locations.

Definitions and Requirements

  • A word consists of only letters except for possessive words like Larry's. The possessive character is the only non-letter character allowed as part of a word.
  • Ignore differences in upper and lower case. "His" and "his" are the same word.
  • A sentence must have at least one word and must end with a period.
  • Count the number of occurrences of each unique word. Words such as "the" will normally be much more than 1.
  • Maintain a sorted list of all unique words. In most cases "a" will be the first word in that sorted list.
  • The get_first_word function will get the first word in alphabetic order from the list (typically "a") and its count.
  • The get_next_word and get_prev_word will navigate through the list. This means you must track the current position in the list.
  • On FAILURE all get_????_word functions will return 0 in the word_count field of the word_entry struct.
  • Do not make any assumption on file size or number of unique words. Your program must work with files containing no words and files with many unique words.
  • DO NOT USE ARRAYS in your linked list! However, you can use an array as a temporary place to store a word as you are reading in a word. You can assume that no word is longer than 100 characters.
  • Overall Evaluation

    As you are writing these functions you need to also create unit tests in the test function in unit_tests.c. To receive full credit for each function your unit test must test the boundary conditions and normal operation of the function. Your unit tests will be graded by a visual inspection combined with how well your functions pass (or do not pass) the instructor's unit tests.

    Suggestions:

  • Implement the functions in the order listed in the specific activities list below.
  • Start with a very simple text file.
  • See the updated main.c for OpenBSD memory checks.
  • Specific Activities and Point Allocations

  • (10) Activity Journal for all required functions
  • (10) A valid Makefile that compiles with debug and the -Wall switch for all source modules and creates the "test" executable.
  • (10) read_file function
  • (10) free_list function
  • (10) get_first_word function
  • (10) get_next_word function
  • (10) get_prev_word function
  • (10) get_last_word function
  • (10) get_sentence_count function
  • (10) get_unique_word_count function
  • (bonus +10) get_most_common_word_after_this_word function
  • (bonus +10) write_unique_word_list_to_csv_file function
  • (deductions of up to 10 points for warnings) There should be no warnings issued when your code compiles with the -Wall switch.
  • (deductions of up to 20 points for memory leaks and memory violations) Use valgrind to verify your code.
  • Setup

    Download analysis.zip. This is just the bare framework.

    Hints

  • Use a doubly linked queue.
  • To implement the bonus you are allowed to add members (not change -- only add) to the word_entry struct.
  • You should have three global pointers to your linked list for the head, tail, and the current position (most recent) entry.
  • Activity Journal

    Fill out an ActivityJournal.txt for each step in your project.

    Submission

    Place your completed analysis.c, analysis.h, unit_tests.c, Makefile, and ActivityJournal.txt files in a directory named analysis at the top level of your git repo.