text compression in lzw and flate

33
Text Compression Using LZW and Flate By Subeer Rangra (08EBKCS05 9) & Mukul Ranjan (08EBKCS02 9)

Upload: subeer-rangra

Post on 29-Nov-2014

3.253 views

Category:

Education


2 download

DESCRIPTION

 

TRANSCRIPT

  • 1. BySubeer Rangra(08EBKCS059)&Mukul Ranjan (08EBKCS029)

2. Index1. Introduction to Data Compression2. Introduction to Text Compression3. LZW 3.1 LZW Encoding Algorithm 3.2 Encoding a String Example 3.2 LZW Decoding Algorithm 3.3 Decoding a String Example.4. Flate Compression 4.1 Decomposition4.1.1 Huffman Coding4.1.2 LZ77 Compression4.1.3 Putting both together5. Advantages and Disadvantages 5.1 LZW 5.2 Flate6. Conclusion 3. 1. Introduction to DataCompression Encoding information using fewer bits than the original representation. Data Compression is achieved when redundancies are reduced or eliminated Lossless where no information is lost. Lossy where some information is lost. Compression reduces the data storage space. 4. Introduction to DataCompression. Contd. Reduces transmission time needed over the network. Data must be decompressed or decoded to be reused. Symmetrical or Asymmetrical Software or Hardware 5. 2. Introduction to TextCompression The compression of Text based data. Major difference between Text and Image compression. Databases, binary programs, text on one side and sound,image, video signals on the other. Text compression needs Losseless Compression. Needed in literary works, product catalogues, genomicdatabases, raw text databases. 6. 3. LZW (Lempel-Ziv-Welch) Starts with a dictionary of all the single characters and graduallybuilds the dictionary as the information is sent through. Lossless compression hence works good for text compression. A dictionary or code table based encoding algorithm. Uses a code table with 4096 as a common choice for number ofentries. It tries to identify repeated sequences of data and adds them tothe code table. 7. LZW (Lempel-Ziv-Welch).contd. A general compression algorithm capable of workingon almost any type of data. Large size Text files in English language can betypically be compressed to half its size. Used in GIF (Graphics Interchange Format) to reducethe size without degrading the visual quality. 8. 3.1 LZW Encoding Algorithm1.STRING = get input character2. WHILE not end of input stream DO3. CHARACTER = get input character4. IF STRING+CHARACTER is in the string table then5. STRING = STRING+CHARACTER6. ELSE7. Output the code for STRING8. add STRING+CHARACTER to the STRING table9. STRING = CHARACTER10. END of IF11. END of WHILE12. Output the code for STRING 9. LZW Encoding Flowchart 10. 3.2 Encoding a String example To encode a string of characters1. First Generate a initial dictionary of single charactersSymbolBinary Decimal#00000 0A00001 1B00010 2C00011 3D00100 4E00101 5Contd..upto Z 11. Encoding a String Example ..contd2. Example TOBEORNOTTOBEORTOBEORNOTCurrent OutputNext Char Extended DictionaryComments SequenceCodeBitsNULL TTO20 1010027: TO27 = first available code after 0 through 26OB15 0111128: OBBE20001029: BEEO50010130: EOOR15 0111131: OR32 requires 6 bits, so for next output use 6RN18 1001032: RNbitsNO14 001110 33: NOOT15 001111 34: OTTT20 010100 35: TT TOB27 011011 36: TOB BEO29 011101 37: BEO 12. Encoding a String Example ..contdTOB 27 011011 36: TOBBEO 29 011101 37: BEOORT 31 011111 38: ORTTOB E 36 100100 39: TOBEEOR 30 011110 40: EORRNO 32 100000 41: RNO # stops the algorithm;OT# 34 100010 send the cur seq0000000and the stop code 13. 3.3 LZW Decoding Algorithm1.Read OLD_CODE2.output OLD_CODE3.CHARACTER = OLD_CODE4.WHILE there are still input characters DO5.Read NEW_CODE6.IF NEW_CODE is not in the translation table THEN7. STRING = get translation of OLD_CODE8. STRING = STRING+CHARACTER9.ELSE10.STRING = get translation of NEW_CODE11. END of IF12. output STRING13. CHARACTER = first character in STRING14. add OLD_CODE + CHARACTER to the translation table15. OLD_CODE = NEW_CODE16. END of WHILE 14. LZW Decoding Flowchart 15. 3.4 Decoding a String Example To decode an LZW-compressed archive, one needs to know in advance the initial dictionary used, but additional entries can be reconstructed as they are always simply concatenations of previous entries. Input New Dictionary EntryOutput CommentsBitsCode Sequence FullConjecture10100 20T 27:T?01111 15O 27:TO 28:O?00010 2 B 28:OB 29:B?00101 5 E 29:BE 30:E?01111 15O 30:EO 31:O? created code 31 (last to fit10010 18R 31:OR 32:R? in 5 bits) so start reading input at 600111014N 32:RN 33:N? bits 16. 4. Flate Compression A lossless data compression. Can discover and exploit many patterns in the inputdata. An improvement over LZW compression, Flateencoded data is usually much more compact thanLZW encoded output. It was originally defined by Phil Katz for version 2 ofhis PKZIP archiving tool and was later specified in RFC1951. Used in PDF compression, Adobe uses a Flatecompression tool for PDF files. 17. 4.1 Decomposition Flate specifications defines a lossless data format thatcompresses data using a combination of LZ77 algorithmand Huffman coding. Hence the format can be implemented readily in a mannernot covered by patents. The manner in which these two algorithms work areexplained below and then the combination of the twowhich work to produce Flate compression. 18. 4.1.1 Huffman Coding A type of entropy encoding algorithm. Used for lossless data compression. Can be used to generate variable-length codes. The variable length codes are generated based on the frequency of the occurrence of the characters. The idea of assigning shortest code to the character with the highest probability of occurrence. 19. Huffman Coding. contd. The algorithm starts by assigning each element aweight a number that represents the relativefrequency within the data to be compressed.Taking an example for the set of weights {1,2,3,3,4}1. They are assigned to be the nodes or leaves of the Huffman tree to be formed 20. Huffman Coding. contd.2. During the first step, the two nodes with weights (highest priority OR lowest probability) 1 and 2 are merged, to create a new tree with a root of weight 3. 21. Huffman Coding. contd.3. Now we have three nodes with weights 3 at their roots, so choosing one of the 3 weighted node. 22. Huffman Coding. contd.4. Now our two minimum trees are the two singleton nodes of weights 3 and 4. We will combine these to form a new tree of weight 7. 23. Huffman Coding. contd.5. Finally we merge our last two remaining trees. 24. Huffman Coding. contd. When all nodes have been recombined into a single``Huffman tree, then by starting at the root andselecting 0 or 1 at each step, you can reach any elementin the tree. Each element now has a Huffman code, which is thesequence of 0s and 1s that represents that paththrough the tree. 25. 4.1.2 LZ77 Compression Works by finding the sequence of data that arerepeated. A lossless data compression algorithm. Maintains a sliding window during compressionwhich means that the compressor have a record ofwhat last characters were. Goes through the text in a sliding window consistingof a search buffer and a look ahead buffer. The search buffer is used as dictionary. 26. LZ77 Compression. contd.1. Suppose the input text isAABABBBABAABABBBABBABB2. The first block found is simply A, encoded as (0,A). The next is AB, encoded as (1,B) where 1 is a reference to A:A|AB|ABBBABAABABBBABBABB3. The next block is ABB, which is encoded as (2,B) where 2 is a reference to AB, entered in the dictionary one iteration ago. Going this way, the string parses into A|AB|ABB|B|ABA|ABAB|BB|ABBA|BB 27. LZ77 Compression. Contd. At the end of the algorithm, the dictionary is:ReferencePhraseEncoding1 A (0,A)2 AB(1,B)3 ABB (2,B)4 B (0,B)5 ABA (2,A)6 ABAB(5,B)7 BB(4,B)8 ABBA(3,A)9 BB(7,0) 28. 4.1.3 Putting Both TogetherThe Flate is a smart algorithm that adapts the way itcompresses data to the actual data themselves. There arethree modes of compression that the compressor hasavailable:1. Not compressed at all an intelligent choice when thedata has already been compressed.2. Compression, first with LZ77 and then with a slightlymodified version of Huffman coding. The trees thatare used are defined by the Flate specification itself. 29. Putting Both Together.contd.3. Compression first with LZ77 and then with Huffman coding with trees that compressor creates and stores along with the data. The data is broken up into blocks each block uses a single mode of compression. 30. 5. Advantages & Disadvantages5.1 LZWAdvantage Is a lossless compression algo. Hence no information is lost. One need not pass the code table between the twocompression and the decompression. Simple, fast and good compression.Disadvantage What happens when the dictionary becomes too large. One approach is to throw the dictionary away when it reachesa certain size. Useful only for a large amount of text data where redundancyis high. 31. Advantages & Disadvantages5.1 Flate CompressionAdvantage Huffman is easy to implement. Flate is a lossless compression technique hence no loss of text. Simple, fast and good compression. Freedom to chose the type of compression based on the need of the content.Disadvantage Overhead is generated due to Huffman tree generation. The actual resulting compression code becomes too complex as it combines LZ77 and Huffman. Its quiet tricky to understand and correctly apply the correct combination of LZ77 and Huffman. 32. 6. Conclusion LZW has various advantages when being used tocompress large text data, in English language whichhas high redundancy. Both LZW and Flate are software based, Dictionaryand lossless methods of compression. The text compression needs lossless technique ofcompression. Flate which is readily used in PDF files, is an adaptive,changeable and complex way to compress text. 33. Thank You