Introduction
to Ruby – histogram.rb
Setup
For this activity you will write two Ruby scripts with an optional third script
histogramN.rb
(where N is 1, 2, or 3) that (eventually) produces a histogram
showing the frequency of occurrence of words
in a text file. Your script will read text from standard
input and print its results to standard output.
Two text files are provided for your use: totc.txt
(the first paragraph fromCharles
Dicken's A Tale of Two Cities) and jabberwocky.txt
(from Lewis Carroll's Through
the Looking-Glass and What Alice Found There).
Place this in a directory called RubyHistogram
and submit via Git as directed by your instructor.
Activity
Steps
Part 1 (histogram1.rb)
- Read
the text file, line by line, from the standard input:
$stdin.each
do |line|
# process this line
end
- Apply
the chomp!
method to each line to remove the end-of-line characters, and print out
each line with upper case letters converted to lower case using puts
See the downcase
(and downcase!)
method of the String
class.
- Enhance
step #2 by removing any characters other than letters and spaces. See
the gsub
(and gsub!)
methods of the String
class, along with the format of regular expressions in the RegExp
class.
/[^a-zA-Z\s]/
regular expression for "any characters other than lower case letters,
upper case letters, and spaces"
- Enhance
step #3 by using sub
(or sub!)
from String to strip any
leading spaces.
/^\s+/
regular expression for "one or more spaces at the beginning of the line"
Part 2 (histogram2.rb)
- Now
change from processing lines to processing the words in each line:
- Split
each line into an array of words on arbitrarily long
whitespace boundary. See the split method in String, and use split(/
+/).
/
+/ regular expression for "words are delimited by one or
more spaces" Note that there is a space
character before the '+'.
- Using
the each
method from Array
and an appropriate block, print each word on a separate line.
- Use
a Hash
named bag to
simulate a bag of strings by mapping each unique string to the count of
its occurrences.
- Create
the bag
with Hash.new(0)
so that the default value for a string is 0.
- Change
the body of the each
block from 5(b) to simply increment the count in bag
for each word, using the word as the hash key.
- After
all words on all lines are accounted for, use the each
method from Hash
to print a list of words and their counts, one word & count per
line.
Note: the each
method of Hash
provides two arguments to the associated block: a key and its value.
The keys, of course, are words, and the values are the counts. The
order of the key/value pairs are generated is essentially random.
- Use
select
from Hash
to get an Array
of key value pairs, but only for words having at least two
occurrences.
Print the resulting words and counts using each
on the array. Note that the values passed to each are themselves arrays
- two element arrays where the element at 0 is the key
and 1st element at 1 is the value.
Part 3 (histogram3.rb) - Optional
- Sort
the array of pairs using the sort
method and a block to do the comparison.
- Using
the counts (the second element in each pair), arrange things so that
the words & values are printed from highest to lowest number of
occurrences.
- Within
a given number of occurrences, sort on the words themselves in
alphabetic order.
- You'll
have to learn about Ruby's <=>
operator to do the sort comparisons.
- To
determine the longest word, use the inject(0)
method on the pairs array - the block returns the larger of the current
maximum and the length of the current word.
- Output the
histogram in the following format.
the
********************
and
***************
he
*******
in ******
through ****
jabberwock
***
my ***
all **
The formatting is tricky, so use this snipet using the pretty print
(pp) library. You will need
to add require
'pp' to
the top of your source.
pairs.each
do | apair |
printf "%-*.*s ", longest, longest, apair[0]
puts "*" * apair[1]
end
Where the longest word determines the width
as determined in Step #8.
- Make the cutoff
for the minimum count an optional command line argument.
- If
ARGV[0]
exists (is not nil),
convert it to an integer using the to_i
method and use this as the minimum count for the words printed in the
histogram.
- Otherwise,
use the minimum of 2 as we've done so far.
Submission
Create a directory called RubyHistogram
and submit histogram1.rb, histogram2.rb and histogram3.rb via
Git.
Resources
A short but useful introduction to Ruby is in The
Little Book of Ruby.
The www.ruby-doc.org
site is a treasure trove of material on Ruby (version 2.4.3 for us),
including:
Check Books
24x7 - students in the past have told me that some of the
Ruby books there
have been helpful.