Introduction to Ruby – histogram.rb

Setup

For this activity you will write two Ruby scripts with an optional third script histogramN.rb (where N is 1, 2, or 3) that (eventually) produces a histogram showing the frequency of occurrence of words in a text file. Your script will read text from standard input and print its results to standard output. 

Two text files are provided for your use: totc.txt (the first paragraph fromCharles Dicken's A Tale of Two Cities) and jabberwocky.txt (from Lewis Carroll's Through the Looking-Glass and What Alice Found There).

Place this in a directory called RubyHistogram and submit via Git as directed by your instructor.

Activity Steps

Part 1 (histogram1.rb)

  1. Read the text file, line by line, from the standard input:
    $stdin.each do |line|
       # process this line
    end
  2. Apply the chomp! method to each line to remove the end-of-line characters, and print out each line with upper case letters converted to lower case using puts See the downcase (and downcase!) method of the String class.
  3. Enhance step #2 by removing any characters other than letters and spaces. See the gsub (and gsub!) methods of the String class, along with the format of regular expressions in the RegExp class.
    /[^a-zA-Z\s]/ regular expression for "any characters other than lower case letters, upper case letters, and spaces"
  4. Enhance step #3 by using sub (or sub!) from String to strip any leading spaces. 
    /^\s+/ regular expression for "one or more spaces at the beginning of the line"

Part 2 (histogram2.rb)

  1. Now change from processing lines to processing the words in each line:
    1. Split each line into an array of words on arbitrarily long whitespace boundary. See the split method in String, and use split(/ +/)
      / +/ regular expression for "words are delimited by one or more spaces" Note that there is a space character before the '+'. 
    2. Using the each method from Array and an appropriate block, print each word on a separate line.
  2. Use a Hash named bag to simulate a bag of strings by mapping each unique string to the count of its occurrences.
    1. Create the bag with Hash.new(0) so that the default value for a string is 0.
    2. Change the body of the each block from 5(b) to simply increment the count in bag for each word, using the word as the hash key.
    3. After all words on all lines are accounted for, use the each method from Hash to print a list of words and their counts, one word & count per line.
      Note: the each method of Hash provides two arguments to the associated block: a key and its value. The keys, of course, are words, and the values are the counts. The order of the key/value pairs are generated is essentially random.
  3. Use select from Hash to get an Array of key value pairs, but only for words having at least two occurrences.
    Print the resulting words and counts using each on the array. Note that the values passed to each are themselves arrays - two element arrays where the element at 0 is the key and 1st element at 1 is the value.

Part 3 (histogram3.rb) - Optional

  1. Sort the array of pairs using the sort method and a block to do the comparison.
    1. Using the counts (the second element in each pair), arrange things so that the words & values are printed from highest to lowest number of occurrences.
    2. Within a given number of occurrences, sort on the words themselves in alphabetic order.
    3. You'll have to learn about Ruby's <=> operator to do the sort comparisons.
  2. To determine the longest word, use the inject(0) method on the pairs array - the block returns the larger of the current maximum and the length of the current word.

  3. Output the histogram in the following format. 

    the        ********************
    and        ***************
    he         *******
    in         ******
    through    ****
    jabberwock ***
    my         ***
    all        **


    The formatting is tricky, so use this snipet using the pretty print (pp) library.
    You will need to add require 'pp' to the top of your source.
      
    pairs.each do | apair |
      printf "%-*.*s ", longest, longest, apair[0]
      puts "*" * apair[1]
    end

    Where the longest word determines the width as determined in Step #8.
  4. Make the cutoff for the minimum count an optional command line argument.
    1. If ARGV[0] exists (is not nil), convert it to an integer using the to_i method and use this as the minimum count for the words printed in the histogram.
    2. Otherwise, use the minimum of 2 as we've done so far.

Submission

Create a directory called RubyHistogram and submit histogram1.rb,  histogram2.rb and histogram3.rb via Git.

Resources

A short but useful introduction to Ruby is in The Little Book of Ruby.

The www.ruby-doc.org site is a treasure trove of material on Ruby (version 1.9.3 for us), including:

Check Books 24x7 - students in the past have told me that some of the Ruby books there have been helpful.