ZPAQ: Difference between revisions

Content deleted Content added

Inline

Revision as of 16:43, 8 September 2010

ZPAQ is a proposed open standard format for highly compressed data. It is designed so that new compression algorithms may be developed without breaking compatibility with older versions of ZPAQ-compliant decompression programs. ZPAQ uses a configurable context mixing algorithm (based on PAQ) and sandboxed post-processing code. ZPAQ uses a streaming data format that supports archives, single file compression, and memory to memory compression. Several ZPAQ compatible compression programs and configurations are available under the GPL license. The format is believed to be unencumbered by patents.

Overview

Initial work on ZPAQ was begun by Matt Mahoney on Feb. 15, 2009. After a series of incompatible experimental versions, the level 1 standard was finalized on Mar. 12, 2009. The standard consisted of a document describing the format and a reference decoder (unzpaq) written in C++ and released under GPL. The standard does not define a compression algorithm. The standard specifies that any ZPAQ level 1 compliant decoder be able to read the output of any other ZPAQ level 1 compliant compressor. The standard has a provision for higher levels which must be able to read the output of lower level compliant compressors but not vice versa. However, no higher levels have yet been defined.

The following software was released under GPL:

unzpaq - the reference decoder.
zpaq - an archiver and environment for developing new compression algorithms, which are described in configuration files and external preprocessors.
zpipe - a simple streaming compressor that reads from standard input and compresses or decompresses to standard output.
zp - an archiver with three built in configurations (fast, mid, max) to trade off speed vs. size.
zpaqsfx - a stub for creating self extracting archives for 32-bit Windows.

In addition, several configurations and preprocessors were published including specialized models for text, BMP images, JPEG images, x86 executable data, and pre/post processors using LZP and BWT compression. Many of these were written by Jan Ondrus.

Data format

ZPAQ uses a streaming data format. A ZPAQ stream consists of a sequence of blocks which can be decompressed independently. Blocks can be reordered, which has the effect of reordering the data it represents. Each block consists of a sequence of one or more segments, which must be decompressed in order from the beginning of the block.

Each block header describes the algorithm needed to decompress its contents and restore the original data. Each segment describes a file, part of a file, or an array of bytes. A segment header contains an optional file name and optional comment to support archives, the compressed data, and end of data marker (4 zero bytes) and finally an optional SHA-1 checksum of the original data for integrity checking. The data is compressed using an arithmetic coder that never outputs 4 zero bytes in a row, which could be confused with an end of segment marker.

Compression algorithm

ZPAQ uses a context mixing algorithm and an optional pre/post processor. A sandboxed assembly-like language called ZPAQL is used to compute the contexts for the context mixing algorithm and to perform any required post-processing. If post-processing is used, then the ZPAQL post-processing code is compressed prior to compressing the first segment of the block. In this case, the remainder of the block after decompression is considered input to the post-processing program.

A context mixing model compresses data one bit at a time. During compression, the model estimates the probability that the next bit will be a 0 or 1. The prediction is arithmetic coded. During decompression, the model makes an identical sequence of predictions based on previously decoded data. This prediction and the compressed data are used to decode the next bit. The decoded bits are then packed 8 at a time into bytes. An extra bit after each byte used to mark the end of the data segment. It is predicted with a very low probability.

The context mixing model is described as an array of 1 to 255 components. Each component inputs a context (a function of past data computed by the supplied ZPAQL code) and possibly the predictions of earlier components and outputs a new prediction. The output of the model is the output of the last component. ZPAQ describes 9 components.

CONST - The prediction is a fixed value.
CM (context model) - The context is mapped to a prediction using a table. Then the table entry is updated by adjusting the prediction to reduce the error. The context is also mapped to a count which is used to control the adjustment. The adjustment is inversely proportional to the count, and then the count is incremented up to a user specified limit.
ICM (indirect context model) - The context is mapped to a bit history (an 8 bit state), and then the bit history is mapped to a prediction and updated as with a CM. The bit history represents a count of recent zeros and ones and the value of the most recent bit.
MIX - The predictions a slice of the component array are combined by weighted averaging in the logistic domain, i.e. p = squash(∑_i w_i stretch(p_i)), where w_i are the mixing weights, p_i are the input predictions, stretch(p) = ln(p/(1-p)), squash(x) = 1/(1+e^-x)) is the inverse of stretch, and p is the output prediction that the next bit will be a 1. The array of weights are selected by a context. On update, the weights are adjusted in proportion to the output prediction error times the input stretch(p_i).
MIX2 - A 2 input MIX from any other 2 components where the weights are constrained to add to 1.
AVG - A 2 input MIX with fixed weights.
SSE (secondary symbol estimation) - A prediction and a context are used to look up a new prediction from a table. The input prediction is first stretched and quantized to 32 levels. The output prediction is interpolated from the nearest two entries. On update, the prediction error is reduced as with a CM.
ISSE (indirect secondary symbol estimation) - The input is a prediction from another component and a context. The context is used to look up a bit history as with an ICM. The bit history is used to select a pair of weights for a 2 input MIX. The prediction is then combined with a CONST and updated as with a MIX.
MATCH - The context is used to look up the previous occurrence of the same context in a rotating history buffer, and the next bit in the buffer is predicted with a strength depending on the length of the match.

Configurations

The zpaq compressor reads a configuration file which describes the compression format. The contents are stored in the block header and may be read and executed by any compliant decompresser. Other compressors such as zpipe and zp use a set of built in compression configurations instead.

Fast configuration

The following is the "fast" configuration, which is also built into the zp archiver (compression level 1). It is designed for higher speed at the expense of compression. Generally, speed depends on the number of components, in this case 2.

(ZPAQ fast configuration)
comp 1 2 0 0 2 (hh hm ph pm n)
  0 icm 16    (order 2)
  1 isse 19 0 (order 4)
hcomp
  *b=a a=0 (save in rotating buffer M)
  d=0 hash b-- hash *d=a
  d++ b-- hash b-- hash *d=a
  halt
post
  0
end

Configuration files are not case sensitive. Comments are enclosed in parenthesis.

The COMP section describes the components that make up the model. The HCOMP section contains ZPAQL code which computes the contexts for the components. POST 0 indicates that there is no post-processing. The decompressed data is output directly.

The COMP section describes two components, an ICM and an ISSE. The ICM (component number 0) takes as input an order 2 context hash (depending on the last 2 bytes of context plus any previously decoded bits from the current byte) to look up a bit history. The bit history indexes a table to look up the prediction. That prediction is fed to the ISSE (component 1), which also takes an order 4 context as input. It's output prediction goes to the arithmetic decoder. The arguments to the two components (16 and 19) give the size of the tables as 2¹⁶ and 2¹⁹ entries, respectively. Each entry requires 4 bytes. Thus, the model uses a little over 2 MB of memory.

The HCOMP section describes the ZPAQL code which computes the context hashes for the two components. The ZPAQL machine model consist of:

32-bit registers A, B, C, D, and R0 through R255.
One bit flag register F for comparisons and conditional jumps.
Array H of 32-bit values, used to store contexts for the component array.
Array M of 8-bit values.

The 5 arguments to COMP give the sizes of H and M for the HCOMP and POST code, respectively, and the number of components. In this case, H has 2¹ elements and M has 2² elements.

The ZPAQL instructions are usually 1 or 2 bytes. Most arithmetic and logic instructions except assignment operate on A, B, C, D, *B, *C, *D or an 8 bit constant and put the result in A. *B and *C refer to elements of M. *D refers to an element of H. The HASH instruction computes A := (A + *B + 512) * 773. The HCOMP code is called once per byte with the input byte in A and all other registers preserved between calls.

Thus, the code first saves the input byte A in M using B as a pointer, then computes an order 2 context hash, stores it in H[0] (pointed to by *D), then hashes the next two bytes and stores it in H[1]. These hashes are updated only once per byte, so they are combined by adding the bitwise contexts from the current byte for each bit prediction.

Mid configuration

The mid level configuration is shown below. It is the default model for zpaq, zp, and zpipe. It provides a fairly good level of compression for most data at some cost in speed.

comp 3 3 0 0 8 (hh hm ph pm n)
  0 icm 5        (order 0...5 chain)
  1 isse 13 0
  2 isse $1+17 1
  3 isse $1+18 2
  4 isse $1+18 3
  5 isse $1+19 4
  6 match $1+22 $1+24  (order 7)
  7 mix 16 0 7 24 255  (order 1)
hcomp
  c++ *c=a b=c a=0 (save in rotating buffer M)
  d= 1 hash *d=a   (orders 1...5 for isse)
  b-- d++ hash *d=a
  b-- d++ hash *d=a
  b-- d++ hash *d=a
  b-- d++ hash *d=a
  b-- d++ hash b-- hash *d=a (order 7 for match)
  d++ a=*c a<<= 8 *d=a       (order 1 for mix)
  halt
post
  0
end

The compression model is similar to the one used in PAQ9A. It uses an order 0 ICM, a chain of order 1 through 5 ISSE, and an order 7 MATCH. All of these predictions are combined using a MIX with an order 1 context. The notation $1 means that an argument may be passed to the configuration file and this argument is added to the values shown. This has the effect of doubling memory usage for each increment because it is used to select the ISSE table sizes and the MATCH index table and buffer sizes respectively. The default uses 111 MB of memory. The arguments to MIX are the table size (2¹⁶, range of inputs (7 components starting at 0), learning rate (24) and a bit mask (255) which tells it to include the bits of the current byte to select the weight array. The second argument to each ISSE selects the previous component prediction as input.

Max configuration

The following configuration is the maximum compression level for zp.

comp 5 9 0 0 22 (hh hm ph pm n)
  0 const 160
  1 icm 5  (orders 0-6)
  2 isse 13 1 (sizebits j)
  3 isse $1+16 2
  4 isse $1+18 3
  5 isse $1+19 4
  6 isse $1+19 5
  7 isse $1+20 6
  8 match $1+22 $1+24
  9 icm $1+17 (order 0 word)
  10 isse $1+19 9 (order 1 word)
  11 icm 13 (sparse with gaps 1-3)
  12 icm 13
  13 icm 13
  14 icm 14 (pic)
  15 mix 16 0 15 24 255 (mix orders 1 and 0)
  16 mix 8 0 16 10 255 (including last mixer)
  17 mix2 0 15 16 24 0
  18 sse 8 17 32 255 (order 0)
  19 mix2 8 17 18 16 255
  20 sse 16 19 32 255 (order 1)
  21 mix2 0 19 20 16 0
hcomp
  c++ *c=a b=c a=0 (save in rotating buffer)
  d= 2 hash *d=a b-- (orders 1,2,3,4,5,7)
  d++ hash *d=a b--
  d++ hash *d=a b--
  d++ hash *d=a b--
  d++ hash *d=a b--
  d++ hash b-- hash *d=a b--
  d++ hash *d=a b-- (match, order 8)
  d++ a=*c a&~ 32 (lowercase words)
  a> 64 if
    a< 91 if (if a-z)
      d++ hashd d-- (update order 1 word hash)
      *d<>a a+=*d a*= 20 *d=a (order 0 word hash)
      jmp 9
    endif
  endif
  (else not a letter)
    a=*d a== 0 ifnot (move word order 0 to 1)
      d++ *d=a d--
    endif
    *d=0  (clear order 0 word hash)
  (end else)
  d++
  d++ b=c b-- a=0 hash *d=a (sparse 2)
  d++ b-- a=0 hash *d=a (sparse 3)
  d++ b-- a=0 hash *d=a (sparse 4)
  d++ a=b a-= 212 b=a a=0 hash
    *d=a b<>a a-= 216 b<>a a=*b a&= 60 hashd (pic)
  d++ a=*c a<<= 9 *d=a (mix)
  d++
  d++
  d++ d++
  d++ *d=a (sse)
  halt
post
  0
end

Components 1 through 8 are a ICM-ISSE chain similar to the mid configuration. Components 9 and 10 are order 0 and 1 word level contexts which improve text compression. The corresponding ZPAQL code converts upper case to lower case, then computes a context hash of bytes that fall in the range 65 to 90 (lowercase ASCII). When a non-letter is found, the order 1 hash is updated and the order 0 hash is cleared. These two components form their own ICM-ISSE chain with contexts in increasing order as before.

Components 11 through 13 are sparse order 1 contexts formed from the current byte plus one preceding byte with a gap of 1 to 3 bytes. These models are not chained. Sparse models can improve compression of some binary files.

Component 14 is specialized for CCITT fax images such as the file PIC in the Calgary corpus. These files encode a bitmapped image 1728 pixels or 216 bytes per line. It obtains a context hash of the corresponding bytes in the previous two scan lines.

Components 15 and 16 mix all previous predictions using an order 1 and order 0 context respectively. Component 17 mixes their outputs. This prediction is then passed through 2 SSE stages (components 18 and 20) with order 0 and 1 contexts respectively. For each stage, another mixer combines the input and output, which can sometimes improve compression further.

LZP (min) configuration

The following configuration (min.cfg) is part of the zpaq distribution intended for minimal but fast compression. It uses LZP preprocessing followed by a simple order 2 context model. Both the compressor and decompresser maintain a hash table that maps high order context hashes to the last occurrence in a rotating history buffer. If the text that follows matches, then that text is removed and replaced by a two byte code consisting of an escape character and the length of the match. The effect is to remove duplicate instances of long matches in the input. For example, the string "compression and decompression" might be encoded as "compression and decom,ESC,8" to tell the decompresser to copy the 8 bytes that followed the last instance of the order 3 context "com".

If the ESC byte occurs in the input, then it is encoded as ESC,0. Normally the ESC byte is chosen so that the need for such encodings are rare. Additional parameters to LZP are the sizes of the context hash table and history buffer, the order (length) of the context, and the minimum match length.

ZPAQ only defines a decompression format. Thus, a compressor would be responsible for pre-processing the input using the LZP algorithm, compressing it, and appending the post-processing code in ZPAQL to the archive header. When used with the zpaq compressor, the pre-processor is an external program written in C++ which is called by the compressor to create a temporary file. The post-processing code is shown below.

(zpaq 1.07 minimum (fast) compression.
Uses 4 x 2^$1 MB memory. $2 increases minimum match length.
Requires lzppre as an external preprocessor.

(C) 2009, Ocarina Networks Inc. Written by Matt Mahoney.
This software is free under GPL v3.
http://www.gnu.org/copyleft/gpl.html )

comp 3 3 $1+18 $1+20 1 (hh hm PH PM n)
  0 cm $1+19 5 (context model size=2^19, limit=5*4)
hcomp
  *d<>a a^=*d a<<= 8 *d=a (order 2 context)
  halt
pcomp lzppre $1+18 $1+20 127 $2+2 96 ;
  (lzppre PH PM ESC MINLEN HMUL)
  (If you change these values, then change them in the code too)

  (The sequence ESC 0 codes for ESC. The sequence ESC LEN
   codes for a match of length LEN+MINLEN at the last place
   in the output buffer M (size 2^PM) that had the same context
   hash in the low PH bits of D. D indexes hash table H
   which points into buffer M, which contains B bytes.
   When called, A contains the byte to be decoded and F=true
   if the last byte was ESC. The rolling context hash D is
   updated by D=D*HMUL+M[B])

  if (last byte was ESC then copy from match)
    a> 0 jf 50 (goto output esc)
    a+= $2+2 (MINLEN)
    r=a 0 (save length in R0)
    c=*d (c points to match)
    do (find match and output it)
      *d=b (update index with last output byte)
      a=*c *b=a b++ c++ out (copy and output matching byte)
      d<>a a*= 96 (HMUL)
      a+=d d<>a (update context hash)
      a=r 0 a-- r=a 0 (decrement length)
    a> 0 while (repeat until length is 0)
    halt
  endif

  (otherwise, set F for ESC)
  a== 127 (ESC) if
    halt
  endif

  (reset state at EOF)
  a> 255 if
    b=0 c=0 a= 1 a<<= $1+18 d=a
    do
      d-- *d=0 a=d (clear index)
    a> 0 while
    halt (F=0)

    (goto here: output esc)
    a= 127 (ESC)
  endif
  *d=b (update index)
  *b=a b++ out (update buffer and output)
  d<>a a*= 96 (HMUL)
  a+=d d<>a (update context hash)
  halt
end

The COMP and HCOMP section describe a simple context model with one component. Other low-order models could be substitute here for better compression at the cost of speed. The purpose of the LZP preprocessor is to compress and remove high order statistics so that a fast, low memory, low order model can be used.

The PCOMP section describes the post-processing code. The external program "lzppre" takes the 5 arguments shown in addition to the names of the input and output (temporary) files. These arguments are the sizes of the context hash table and history buffer (as powers of 2, respectively 2¹⁸ and 2²⁰), the value of the ESC byte (127), the match length offset (2), and the hash multiplier (HMUL = 96). These values must match the corresponding values in the ZPAQL post-processor for correct decoding. The match length offset is added to the encoded length (1 to 255) to represent matches of length 3 to 257. The compressor arguments $1 and $2 can be passed by the user to control memory allocation and the minimum match length respectively. These values are passed to both the preprocessor and the embedded post-processing code.

The hash function is HASH ← (HASH + A) * HMUL (mod 2¹⁸) (depending on the context hash table size), where A is the next input byte. If HMUL = 96 = 3 * 2⁵, then the high 5 bits of the hash are shifted out for each byte. Thus HASH depends on only the last 4 bytes of input, effectively computing an order 4 context hash.

In the ZPAQL code, M is used as a history buffer and H is used as the context hash table. The flag bit F is used to encode whether the last input byte was an ESC. The C register is used as a pointer into M to where the next byte will be stored. D is the current context hash, which points into H. The OUT instruction outputs the contents of A to the decompressed output file. On input, A is the decoded byte from the decompresser (prior to LZP decoding), or 2³² - 1 at the end of a segment. The program is called once for each input byte.

Burrows-Wheeler transform (bwt_j3)

The configuration bwt_j3 [1] by Jan Ondrus pre-processes the input with a Burrows-Wheeler transform (BWT). The input is sorted by context, which removes any high order statistics. The post-processor performs an inverse transform. The transformed data is compressed using a fast adapting order 0-1-2-4 ICM-ISSE chain followed by an order 0 MIX and an order 0 SSE adaptively bypassed by a MIX2. The model is based on the BBB [2] compressor.

BWT normally divides the input into blocks and uses 5 times the block size in memory to compress and decompress. The configuration bwt_slowmodel [3] uses a slower but more memory efficient algorithm modeled on a similar algorithm used in BBB. Blocks are compressed using 1.25 times the block size by sorting in smaller blocks to temporary files and then merging. The inverse transform, instead of building and traversing a linked list, uses a smaller index to find the approximate location of the next decoded byte, followed by a final linear search. By conserving memory, larger blocks can be used, which improves compression.

External links

Official website