hybrid prefix codes for practical use - brandeisdilant/cs175/talks_1/[p... · 2006. 11. 11. ·...

19
Hybrid prefix codes for Hybrid prefix codes for practical use practical use ( dcc dcc – 2003) 2003) - Palak Palak Mehta Mehta 11/13/2006 11/13/2006 MIKE LIDDEL and ALISTAIR MOFFAT

Upload: others

Post on 08-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Hybrid prefix codes forHybrid prefix codes forpractical usepractical use

((dccdcc –– 2003) 2003)

-- PalakPalak Mehta Mehta11/13/200611/13/2006

MIKE LIDDELand

ALISTAIR MOFFAT

Page 2: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

The Agenda:The Agenda:____________________________________________________________________________________________________________________________________________________________________________________

Why Hybrid Prefix Codes (K-flat codes)?Why Hybrid Prefix Codes (K-flat codes)? DefinitionDefinition Algorithm Algorithm –– Calculate K-flat codes Calculate K-flat codes Reducing the space requirementsReducing the space requirements Reducing the time requirementsReducing the time requirements Combining time and space improvementsCombining time and space improvements Redundancy of a K-flat codesRedundancy of a K-flat codes Experimental resultsExperimental results ConclusionConclusion

Page 3: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Why Hybrid prefix codes:Why Hybrid prefix codes:____________________________________________________________________________________________________________________________________________________________________________________

Flat or fixed codes have simple structure whichFlat or fixed codes have simple structure whichfacilitates fast decoding but compression is oftenfacilitates fast decoding but compression is oftenpoor.poor.

Whereas Minimum-redundancy prefix codes areWhereas Minimum-redundancy prefix codes areslower to decode but the compressed file size isslower to decode but the compressed file size isminimized.minimized.

The Hybrid code enables fast decoding and The Hybrid code enables fast decoding andalso provide fast compressed string searching.also provide fast compressed string searching.

Page 4: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Definition: Definition:__________________________________________________________________________________________________________________________________________________________________________________________

A K-flat code comprises K= 2A K-flat code comprises K= 2kk flat sub-trees, of flat sub-trees, ofdepths ddepths d11,d,d22,,……ddkk, where each sub-tree is rooted, where each sub-tree is rootedat depth k. A K-flat code over n symbols, isat depth k. A K-flat code over n symbols, isrepresented by an arrangement which lists therepresented by an arrangement which lists thenumber of elements per sub-tree, number of elements per sub-tree, A(K,nA(K,n) =) ={c{c11,c,c22,,……cckk}, with }, with __11≤≤ii≤≤KK ccii=n.=n.

Page 5: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Example:Example:________________________________________________________________________________________________________________________________________________________________________________________

Page 6: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Dynamic programmingDynamic programmingsolution:solution:

__________________________________________________________________________________________________________________________________________________________________________

To construct an algorithm for minimum-redundancy K-To construct an algorithm for minimum-redundancy K-flat code that has DPS, we first introduce the followingflat code that has DPS, we first introduce the followingconcepts :concepts :

Canonical K-flat code Canonical K-flat code is a code for which the first K-1is a code for which the first K-1sub-trees are fully flat and the last sub tree is partiallysub-trees are fully flat and the last sub tree is partiallyflat.flat.

Minimum redundancy K-flat arrangement Minimum redundancy K-flat arrangement A(K,nA(K,n) ) shouldshouldbe decomposed into two components: a minimumbe decomposed into two components: a minimumredundancy arrangement containing K-1 trees and mredundancy arrangement containing K-1 trees and msymbols,A(K-1,m) where m = n-csymbols,A(K-1,m) where m = n-ckk, and an arrangement, and an arrangementcontaining one tree and ccontaining one tree and ckk symbols, A(1,c symbols, A(1,ckk).).

Page 7: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Algorithm:Algorithm:__________________________________________________________________________________________________________________________________________________________________________

1.1. Input: sorted frequencies fInput: sorted frequencies f11,f,f22,,……,f,fnn and K and K2.2. Set Set FFxx = = ∑∑1 1 ≤≤ i i ≤≤ x x fifi, 1 , 1 ≤≤ i i ≤≤ n. n.3.3. Set L(1,c) = c and C(1,c) = Set L(1,c) = c and C(1,c) = FFii [log c], for 1 [log c], for 1 ≤≤ c c ≤≤ n. n.4.4. Set all other Set all other C(r,cC(r,c) values to ) values to ‘‘undefinedundefined’’..5.5. /* generate the table items *//* generate the table items */6.6. For r = 2For r = 2……K doK do7.7. For c = 1 For c = 1……n don do8.8. For x = 1 For x = 1……c-1 doc-1 do9.9. if c - x < L(r-1,x)/2 + 1 or L(r-1,x) is not a power of 2 then if c - x < L(r-1,x)/2 + 1 or L(r-1,x) is not a power of 2 then10.10. skip this extension ( as it is not canonical). skip this extension ( as it is not canonical).11.11. set cost = C(r-1,x) + ( set cost = C(r-1,x) + (FFcc - - FFxx) [) [log(c-xlog(c-x)].)].12.12. if cost < if cost < C(r,cC(r,c) then) then13.13. set set L(r,cL(r,c) = c ) = c –– x and x and C(r,cC(r,c) = cost) = cost14.14. /* back /* back ––trace to determine trace to determine A(K,nA(K,n) */) */15.15. Set r = K and c = n.Set r = K and c = n.16.16. While r While r ≥≥ 1 do 1 do17.17. set set ccrr = = L(r,cL(r,c).).18.18. set r = r set r = r –– 1 119.19. set c = c set c = c –– L(r,cL(r,c).).20.20. Output : Output : A(K,nA(K,n) = {c) = {c11,c,c22,,……,c,ckk}.}.

Page 8: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Example:Example:____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

Page 9: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Time And SpaceTime And SpaceRequirements:Requirements:

____________________________________________________________________________________________________________________________________________________________________________________

The Algorithm calculates the minimumThe Algorithm calculates the minimum––redundancy K- flat code for n symbols and Kredundancy K- flat code for n symbols and Ksub-trees in O(Knsub-trees in O(Kn22) time and ) time and O(KnO(Kn) space.) space.

How to reduce space requirements?How to reduce space requirements?

How to reduce time requirements?How to reduce time requirements?

Page 10: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

To reduce space, note only those items for which To reduce space, note only those items for which L(r,cL(r,c))is a power of two are visited during the back-trace phaseis a power of two are visited during the back-trace phaseexcept for the item stored in except for the item stored in L(K,nL(K,n).).Also, if Also, if L(r,cL(r,c)=2)=2aa and and L(r,cL(r,c’’)=2)=2bb, then , then bb≥≥aa..

Thus the original table may be simplified to a transitionThus the original table may be simplified to a transitiontable which records, for each row, the first column wheretable which records, for each row, the first column whereeach power of two appears.each power of two appears.

So the required changes to algorithm are to calculateSo the required changes to algorithm are to calculateand store only the transition information for each row andand store only the transition information for each row andto alter the back-trace phase to use the new format.to alter the back-trace phase to use the new format.

Reducing the spaceReducing the spacerequirementsrequirements::

____________________________________________________________________________________________________________________________________________________________________________________

Page 11: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Now, an upper bound for the Now, an upper bound for the L(r,cL(r,c) values is [log n] and) values is [log n] andso the transition table requires O(K log n) space.so the transition table requires O(K log n) space.

To generate the transition table, the original algorithm isTo generate the transition table, the original algorithm isrun as usual, but only two rows of the full table arerun as usual, but only two rows of the full table aremaintained.maintained.

After the creation of a full row, the corresponding row forAfter the creation of a full row, the corresponding row forthe transition table is created and the previous row isthe transition table is created and the previous row isthen discarded.then discarded.

Thus, the total space required is O (n + K log n).Thus, the total space required is O (n + K log n).

Reducing the spaceReducing the spacerequirements requirements ((ContdContd……))::

____________________________________________________________________________________________________________________________________________________________________________________

Page 12: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Reducing the timeReducing the timerequirements:requirements:

____________________________________________________________________________________________________________________________________________________________________________________

To reduce time, a new table is formed called an To reduce time, a new table is formed called anapproximate table that allows only fully flat sub-treesapproximate table that allows only fully flat sub-treesexcept for except for L(K,nL(K,n).).

The basic change required is to alter the inner-most loop The basic change required is to alter the inner-most loopof the algorithm to only generate extensions such that c -of the algorithm to only generate extensions such that c -x is a power of two.x is a power of two.

So the extensions are chosen from the set {1,2,4, So the extensions are chosen from the set {1,2,4,……,[log,[logn]} and the inner-loop executes n]} and the inner-loop executes O(logO(log n) times and the n) times and thealgorithm as a whole requires algorithm as a whole requires O(KnO(Kn log n) time. log n) time.

Page 13: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Reducing the timeReducing the timerequirements requirements ((ContdContd……))::

______________________________________________________________________________________________________________________________________________________________________________________

For further savings, for any valid item in an approximate For further savings, for any valid item in an approximatetable, all valid items to left, table, all valid items to left, I(r,cI(r,c-a), have -a), have L(r,cL(r,c-a) -a) ≤≤ 2L(r,c) 2L(r,c)and all valid items to right, and all valid items to right, I(r,c+aI(r,c+a), have ), have L(r,c+aL(r,c+a) ) ≥≥L(r,c)/2.L(r,c)/2.

A partitioning strategy is used to first calculate item A partitioning strategy is used to first calculate itemI(r,[n/2]). The value of L(r,[n/2]) can then be used to limitI(r,[n/2]). The value of L(r,[n/2]) can then be used to limitthe range of other the range of other L(r,cL(r,c) values on row r. The items ) values on row r. The items I(r,cI(r,c))for 1for 1≤≤c<[n/2] are then calculated recursively, as are itemsc<[n/2] are then calculated recursively, as are itemsI(r,cI(r,c) for [n/2]<) for [n/2]<cc≤≤nn..

So, if the original algorithm is adjusted to create an So, if the original algorithm is adjusted to create anapproximate table, and creates each row of that table byapproximate table, and creates each row of that table byusing a partitioning approach, the running time of theusing a partitioning approach, the running time of theamended algorithm is O (amended algorithm is O (KnKn).).

Page 14: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Example: Example:__________________________________________________________________________________________________________________________________________________________________________

Page 15: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Combining time and spaceCombining time and spaceimprovements:improvements:

__________________________________________________________________________________________________________________________________________________________________________

Combining the time and space improvements requiresCombining the time and space improvements requiresthe creation of transition table for an approximate table.the creation of transition table for an approximate table.

Also, if in an approximate table, Also, if in an approximate table, L(r,c+aL(r,c+a)< )< L(r,cL(r,c) for a>0,) for a>0,then then L(r,cL(r,c) does not represent a minimum-redundancy) does not represent a minimum-redundancysolution for solution for A(r,cA(r,c).).

To implement this new rule, each row of the transitionTo implement this new rule, each row of the transitiontable is created by locating the last entry for table is created by locating the last entry for L(r,cL(r,c) of each) of eachpower of two between 1 and [log n]. This informationpower of two between 1 and [log n]. This informationallows any required allows any required L(r,cL(r,c) value to be determined.) value to be determined.

Page 16: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Redundancy of K - flat code:Redundancy of K - flat code:______________________________________________________________________________________________________________________________________________________________________________________

If K is small then a K-flat code may not be flexible enoughIf K is small then a K-flat code may not be flexible enoughto capture the properties of the input.to capture the properties of the input.

If K is large the code is flexible but the shortest codeIf K is large the code is flexible but the shortest codeword allowed is k = Log K which may be unwieldy.word allowed is k = Log K which may be unwieldy.

When Huffman code for a large input yields a shortestWhen Huffman code for a large input yields a shortestcodeword, codeword, llHH, where , where llHH ≥≥ 4, the best K-flat code typically 4, the best K-flat code typicallyhas k has k ≈≈ llHH and the redundancy of this code is only and the redundancy of this code is onlymarginally higher than for the unrestricted code.marginally higher than for the unrestricted code.

If If llHH < 4, there is risk that a K-flat code cannot code the < 4, there is risk that a K-flat code cannot code thealphabet without significant redundancy.alphabet without significant redundancy.

Page 17: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Experimental results:Experimental results:________________________________________________________________________________________________________________________________________________________________________________________

Page 18: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Experimental results Experimental results (CONTD(CONTD……):):______________________________________________________________________________________________________________________________________________________________________________________

Page 19: Hybrid prefix codes for practical use - Brandeisdilant/cs175/Talks_1/[P... · 2006. 11. 11. · Hybrid Codes-Palak Mehta.ppt Author: Antonella Created Date: 11/11/2006 4:35:26 PM

Conclusion:Conclusion:________________________________________________________________________________________________________________________________________________________________________________________

Fast Decoding and fast compressed stringFast Decoding and fast compressed stringsearching.searching.

Redundancy of K-Flat codes on inputs such asRedundancy of K-Flat codes on inputs such astrecwordtrecword is only marginally in excess of that is only marginally in excess of thatachieved by minimum-redundancy unrestrictedachieved by minimum-redundancy unrestrictedcodes.codes.