homework #5
DESCRIPTION
Homework #5. New York University Computer Science Department Data Structures Fall 2008 Eugene Weinstein. Homework #4 Review. Huffman coding is a variable-length binary encoding for text We implemented Huffman's optimal code finding algorithm (book 389-395) - PowerPoint PPT PresentationTRANSCRIPT
Homework #5
New York UniversityComputer Science Department
Data Structures Fall 2008Eugene Weinstein
Homework #4 Review
• Huffman coding is a variable-length binary encoding for text
• We implemented Huffman's optimal code finding algorithm (book 389-395)o Builds tree representing shortest possible code
• Input for HW#4: letters, frequencies:o A 20 E 24 ...
• Construct Huffman tree• Navigate tree to find code:
o c: 0, a: 10, b: 11
Homework #5 Overview
• Given a documento Calculate letter frequencieso Construct Huffman codeo Encode documento Calculate memory savings of Huffman binary
encoding vs 8-bit ASCIIo Correctly decode document
• We can use Huffman code building algorithm from HW#4o So we will keep HuffmanTree and HuffmanNode
Organization
• The new code for this assignment should go into HuffmanConverter.javao The filename of file to encode is passed as a
parameter on the command lineo So if my file is foo.txt, I should be able to run
java HuffmanConverter foo.txto Then foo.txt show up in args[0]o If you use an IDE, specify command-line options
through the menus• Test inputs and outputs linked from assignment
page (2007 version)
HuffmanConverter Instance Vars
• String contents - stores file to processo Lines are separated by '\n' - line break charactero e.g., twoLines = line1 + '\n' + line2;
• HuffmanTree huffmanTree - output of HW4 • int count[] - frequencies in input file
o Indexed on ASCII value of characters, e.g., count[(int)'a'] is frequency of 'a'
• String code[] - binary string per charactero Also indexed on ASCII value, e.g., code[(int)'a']
== "10001"
To Implement
• readContents() - reads in a file and stores in String contents
• recordFrequencies() - process file stored in contents and store frequencies in count[]
• frequenciesToTree() - use HW4 code to produce Huffman tree
• treeToCode() - slight modification of HW4: traverse Huffman tree and populate code[]
• encodeMessage() - use code[] to encode• decodeMessage() - use inverse of code[]
Implementation Notes
• readContents() can use Scannero Read a line at a time, and append to contents
inserting '\n' to separate lines• recordFrequencies(): iterate over contents one
character at a time• frequenciesToTree()
o Very similar to main() method of HW4 o Create a BinaryHeap objecto For every non-zero-count letter, create a
HuffmanNode object, insert into heapo Then run Huffman algorithm
Implementation Notes, Cont'd
• treeToCode()o Similar to printCode() of HW4o Instead of printing code, store in code[]
• encodeMessage()o For each character of contents, look up its binary
string in code[], append
Implementation Notes, Cont'd
• decodeMessage()o Need to implement inverse mapping of code[]:
binary strings to characterso Several possible implementations
Traverse Huffman tree as you read binary string, output character when you reach a leaf
Build HashMap mapping strings to ASCII values of characters
HashMap
• An array maps integers to Objectso e.g., String args[]: args[i] returns ith String
• A HashMap maps Objects to Objects• Access with put() and get(), e.g.,
o HashMap ids = new HashMap();o ids.put("Alice", 123456789);o ids.put("Ben", 321654987);o int id = (Integer) ids.get("Alice"); o // id gets 123456789
• For decode, map bit Strings to characters
Homework #5 Tips
• Keep checking intermediate results• Make use of sample outputs here• Print out intermediate results!• You might need special cases for newline ('\n')• Your encoding might differ from the examples
o Depends on the BinaryHeap implementationo Same-frequency items are returned in arbitrary
order (e.g., in love_poem_58, 'N', '-', '.', 'W', and 'p' all have frequency one)
• However, Huffman encoding length must match!o Guaranteed to be shortest-length encoding