annotate zlib/algorithm.txt @ 51:ae3a4bfb450b

add some files of version 4.4.3 that have been forgotten.
author kent <kent@cr.ie.u-ryukyu.ac.jp>
date Sun, 07 Feb 2010 18:27:48 +0900
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
51
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
1 1. Compression algorithm (deflate)
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
3 The deflation algorithm used by gzip (also zip and zlib) is a variation of
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
4 LZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
5 the input data. The second occurrence of a string is replaced by a
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
6 pointer to the previous string, in the form of a pair (distance,
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
7 length). Distances are limited to 32K bytes, and lengths are limited
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
8 to 258 bytes. When a string does not occur anywhere in the previous
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
9 32K bytes, it is emitted as a sequence of literal bytes. (In this
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
10 description, `string' must be taken as an arbitrary sequence of bytes,
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
11 and is not restricted to printable characters.)
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
12
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
13 Literals or match lengths are compressed with one Huffman tree, and
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
14 match distances are compressed with another tree. The trees are stored
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
15 in a compact form at the start of each block. The blocks can have any
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
16 size (except that the compressed data for one block must fit in
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
17 available memory). A block is terminated when deflate() determines that
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
18 it would be useful to start another block with fresh trees. (This is
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
19 somewhat similar to the behavior of LZW-based _compress_.)
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
20
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
21 Duplicated strings are found using a hash table. All input strings of
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
22 length 3 are inserted in the hash table. A hash index is computed for
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
23 the next 3 bytes. If the hash chain for this index is not empty, all
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
24 strings in the chain are compared with the current input string, and
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
25 the longest match is selected.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
26
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
27 The hash chains are searched starting with the most recent strings, to
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
28 favor small distances and thus take advantage of the Huffman encoding.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
29 The hash chains are singly linked. There are no deletions from the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
30 hash chains, the algorithm simply discards matches that are too old.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
31
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
32 To avoid a worst-case situation, very long hash chains are arbitrarily
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
33 truncated at a certain length, determined by a runtime option (level
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
34 parameter of deflateInit). So deflate() does not always find the longest
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
35 possible match but generally finds a match which is long enough.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
36
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
37 deflate() also defers the selection of matches with a lazy evaluation
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
38 mechanism. After a match of length N has been found, deflate() searches for
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
39 a longer match at the next input byte. If a longer match is found, the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
40 previous match is truncated to a length of one (thus producing a single
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
41 literal byte) and the process of lazy evaluation begins again. Otherwise,
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
42 the original match is kept, and the next match search is attempted only N
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
43 steps later.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
44
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
45 The lazy match evaluation is also subject to a runtime parameter. If
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
46 the current match is long enough, deflate() reduces the search for a longer
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
47 match, thus speeding up the whole process. If compression ratio is more
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
48 important than speed, deflate() attempts a complete second search even if
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
49 the first match is already long enough.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
50
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
51 The lazy match evaluation is not performed for the fastest compression
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
52 modes (level parameter 1 to 3). For these fast modes, new strings
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
53 are inserted in the hash table only when no match was found, or
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
54 when the match is not too long. This degrades the compression ratio
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
55 but saves time since there are both fewer insertions and fewer searches.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
56
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
57
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
58 2. Decompression algorithm (inflate)
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
59
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
60 2.1 Introduction
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
61
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
62 The key question is how to represent a Huffman code (or any prefix code) so
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
63 that you can decode fast. The most important characteristic is that shorter
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
64 codes are much more common than longer codes, so pay attention to decoding the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
65 short codes fast, and let the long codes take longer to decode.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
66
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
67 inflate() sets up a first level table that covers some number of bits of
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
68 input less than the length of longest code. It gets that many bits from the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
69 stream, and looks it up in the table. The table will tell if the next
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
70 code is that many bits or less and how many, and if it is, it will tell
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
71 the value, else it will point to the next level table for which inflate()
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
72 grabs more bits and tries to decode a longer code.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
73
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
74 How many bits to make the first lookup is a tradeoff between the time it
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
75 takes to decode and the time it takes to build the table. If building the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
76 table took no time (and if you had infinite memory), then there would only
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
77 be a first level table to cover all the way to the longest code. However,
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
78 building the table ends up taking a lot longer for more bits since short
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
79 codes are replicated many times in such a table. What inflate() does is
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
80 simply to make the number of bits in the first table a variable, and then
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
81 to set that variable for the maximum speed.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
82
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
83 For inflate, which has 286 possible codes for the literal/length tree, the size
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
84 of the first table is nine bits. Also the distance trees have 30 possible
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
85 values, and the size of the first table is six bits. Note that for each of
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
86 those cases, the table ended up one bit longer than the ``average'' code
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
87 length, i.e. the code length of an approximately flat code which would be a
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
88 little more than eight bits for 286 symbols and a little less than five bits
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
89 for 30 symbols.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
90
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
91
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
92 2.2 More details on the inflate table lookup
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
93
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
94 Ok, you want to know what this cleverly obfuscated inflate tree actually
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
95 looks like. You are correct that it's not a Huffman tree. It is simply a
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
96 lookup table for the first, let's say, nine bits of a Huffman symbol. The
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
97 symbol could be as short as one bit or as long as 15 bits. If a particular
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
98 symbol is shorter than nine bits, then that symbol's translation is duplicated
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
99 in all those entries that start with that symbol's bits. For example, if the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
100 symbol is four bits, then it's duplicated 32 times in a nine-bit table. If a
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
101 symbol is nine bits long, it appears in the table once.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
102
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
103 If the symbol is longer than nine bits, then that entry in the table points
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
104 to another similar table for the remaining bits. Again, there are duplicated
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
105 entries as needed. The idea is that most of the time the symbol will be short
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
106 and there will only be one table look up. (That's whole idea behind data
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
107 compression in the first place.) For the less frequent long symbols, there
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
108 will be two lookups. If you had a compression method with really long
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
109 symbols, you could have as many levels of lookups as is efficient. For
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
110 inflate, two is enough.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
111
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
112 So a table entry either points to another table (in which case nine bits in
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
113 the above example are gobbled), or it contains the translation for the symbol
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
114 and the number of bits to gobble. Then you start again with the next
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
115 ungobbled bit.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
116
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
117 You may wonder: why not just have one lookup table for how ever many bits the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
118 longest symbol is? The reason is that if you do that, you end up spending
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
119 more time filling in duplicate symbol entries than you do actually decoding.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
120 At least for deflate's output that generates new trees every several 10's of
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
121 kbytes. You can imagine that filling in a 2^15 entry table for a 15-bit code
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
122 would take too long if you're only decoding several thousand symbols. At the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
123 other extreme, you could make a new table for every bit in the code. In fact,
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
124 that's essentially a Huffman tree. But then you spend two much time
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
125 traversing the tree while decoding, even for short symbols.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
126
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
127 So the number of bits for the first lookup table is a trade of the time to
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
128 fill out the table vs. the time spent looking at the second level and above of
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
129 the table.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
130
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
131 Here is an example, scaled down:
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
132
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
133 The code being decoded, with 10 symbols, from 1 to 6 bits long:
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
134
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
135 A: 0
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
136 B: 10
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
137 C: 1100
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
138 D: 11010
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
139 E: 11011
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
140 F: 11100
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
141 G: 11101
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
142 H: 11110
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
143 I: 111110
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
144 J: 111111
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
145
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
146 Let's make the first table three bits long (eight entries):
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
147
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
148 000: A,1
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
149 001: A,1
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
150 010: A,1
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
151 011: A,1
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
152 100: B,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
153 101: B,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
154 110: -> table X (gobble 3 bits)
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
155 111: -> table Y (gobble 3 bits)
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
156
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
157 Each entry is what the bits decode as and how many bits that is, i.e. how
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
158 many bits to gobble. Or the entry points to another table, with the number of
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
159 bits to gobble implicit in the size of the table.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
160
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
161 Table X is two bits long since the longest code starting with 110 is five bits
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
162 long:
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
163
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
164 00: C,1
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
165 01: C,1
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
166 10: D,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
167 11: E,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
168
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
169 Table Y is three bits long since the longest code starting with 111 is six
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
170 bits long:
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
171
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
172 000: F,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
173 001: F,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
174 010: G,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
175 011: G,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
176 100: H,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
177 101: H,2
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
178 110: I,3
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
179 111: J,3
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
180
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
181 So what we have here are three tables with a total of 20 entries that had to
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
182 be constructed. That's compared to 64 entries for a single table. Or
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
183 compared to 16 entries for a Huffman tree (six two entry tables and one four
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
184 entry table). Assuming that the code ideally represents the probability of
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
185 the symbols, it takes on the average 1.25 lookups per symbol. That's compared
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
186 to one lookup for the single table, or 1.66 lookups per symbol for the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
187 Huffman tree.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
188
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
189 There, I think that gives you a picture of what's going on. For inflate, the
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
190 meaning of a particular symbol is often more than just a letter. It can be a
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
191 byte (a "literal"), or it can be either a length or a distance which
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
192 indicates a base value and a number of bits to fetch after the code that is
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
193 added to the base value. Or it might be the special end-of-block code. The
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
194 data structures created in inftrees.c try to encode all that information
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
195 compactly in the tables.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
196
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
197
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
198 Jean-loup Gailly Mark Adler
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
199 jloup@gzip.org madler@alumni.caltech.edu
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
200
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
201
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
202 References:
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
203
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
204 [LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
205 Compression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3,
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
206 pp. 337-343.
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
207
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
208 ``DEFLATE Compressed Data Format Specification'' available in
ae3a4bfb450b add some files of version 4.4.3 that have been forgotten.
kent <kent@cr.ie.u-ryukyu.ac.jp>
parents:
diff changeset
209 http://www.ietf.org/rfc/rfc1951.txt