Canonical S-expressions

From Wikipedia, the free encyclopedia
Jump to: navigation, search

A Canonical S-expression (or csexp) is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

The particular subset of general S-expressions applicable here is composed of atoms, which are byte strings, and parentheses used to delimit lists or sub-lists. These S-expressions are fully recursive.

While S-expressions are typically encoded as text, with spaces delimiting atoms and quotation marks used to surround atoms that contain spaces, when using the canonical encoding each atom is encoded as a length-prefixed byte string. No whitespace separating adjacent elements in a list is permitted. The length of an atom is expressed as an ASCII decimal number followed by a ":".

Contents

[edit] Example

The sexp

(this "Canonical S-expression" has 5 atoms)

becomes the csexp

(4:this22:Canonical S-expression3:has1:55:atoms)

Note that no quotation marks are required to escape the space character internal to the atom "Canonical S-expression", because the length prefix clearly points to the end of the atom. Note also that there is no whitespace separating an atom from the next element in the list.

[edit] Properties

  • Uniqueness of canonical encoding: Forbidding whitespace between list elements and providing just one way of encoding atoms ensures that every S-expression has exactly one encoded form. Since the unique encoded form is itself a sequence of bytes, by hashing it we can provide every S-expression with a unique hash value. Furthermore, we can decide whether two S-expressions are equivalent by comparing their encodings.
  • Support for binary data: Atoms can be any binary string. So, a cryptographic hash value or a public key modulus that would otherwise have to be encoded in base64 or some other printable encoding can be expressed in csexp as its binary bytes.
  • Support for type-tagging encoded information: A csexp includes a non-S-expression construct for indicating the encoding of a string, when that encoding is not obvious. Any atom in csexp can be prefixed by a single atom in square brackets - such as "[4:JPEG]" or "[24:text/plain;charset=utf-8]".

[edit] Interpretation and Restrictions

While csexps generally permit empty lists, empty atoms, and so forth, certain uses of csexps impose additional restrictions. For example, csexps as used in SPKI have one limitation compared to csexps in general: every list must start with an atom, and therefore there can be no empty lists.

Typically, a list's first atom is treated as one treats an element name in XML.

[edit] Comparisons to other encodings

There are other encodings in common use:

  1. XML
  2. ASN.1
  3. JSON

Generally, csexp has a parser one or two decimal orders of magnitude smaller than that of either XML or ASN.1. This small size and corresponding speed give csexp its main advantage. In addition to the parsing advantage, there are other differences.

[edit] csexp vs. XML

A csexp is roughly as expressive as XML. This is not surprising, since XML is described as an ASCII form for S-expressions. However, csexp does not have a concept like XML attributes (within an element). When encoding in csexp, one must plan on a representation for such attributes.

A csexp, like a general sexp, is fully recursive. XML, however, has limitations on recursive use of element names.

The first atom in a csexp list - the equivalent of an XML element name - can be any atom in any encoding (e.g., a JPEG, a UNICODE string, a WAV file, ...). XML element names are constrained to use a subset of the printable character set.

XML merges a sequence of strings within one element into a single string, while csexp allows a sequence of atoms within a list and those atoms remain separate from one another.

Finally, csexp is inherently binary while XML is printable - so binary quantities in XML must be encoded, for example using base64.

[edit] csexp vs. ASN.1

ASN.1 is a popular binary encoding form. However, it expresses only syntax (data types), not semantics. Two different structures - each a SEQUENCE of two INTEGERS - have identical representations on the wire (barring special tag choices to distinguish them). To parse an ASN.1 structure, one must tell the parser what set of structures one is expecting and the parser must match the data type being parsed against the structure options. This adds to the complexity of an ASN.1 parser.

A csexp structure, like an XML document, carries its own semantics (encoded in element names), and the parser for a csexp structure does not care what structure is being parsed. Once a wire-format expression has been parsed into an internal tree form (similar to XML's DOM), the consumer of that structure can examine it for conformance to what was expected.

[edit] Links

Personal tools
Namespaces

Variants
Actions
Navigation
Interaction
Toolbox
Print/export