Jump to content

YAML

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 83.67.217.254 (talk) at 09:12, 9 March 2008 (same as Clark Evans). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

YAML (/ˈjæməl/, rhymes with camel, ) is a human-readable data serialization format that takes concepts from languages such as XML, C, Python, Perl, as well as the format for electronic mail as specified by RFC 2822. YAML was first proposed by Clark Evans in 2001, who designed it together with Ingy döt Net and Oren Ben-Kiki.

YAML is a recursive acronym for "YAML Ain't a Markup Language". Early in its development, YAML was said to mean "Yet Another Markup Language", but was retronymed to distinguish its purpose as data-centric, rather than document markup.

Features

YAML syntax is relatively straightforward and was designed to be easily mapped to data types common to most high-level languages: list; hash; and scalar.[1] Its familiar indented outline and lean appearance makes it especially suited for tasks where humans are likely to view or edit data structures, such as configuration files, dumping during debugging, and document headers (e.g. the headers found on most e-mails are very close to YAML in look). Although visually well-suited for hierarchical data representation, it also has a compact syntax for a relational data as well.[2] Its line and whitespace delimiters make it friendly to ad hoc grep/Python/Perl/Ruby operations. A major part of its accessibility comes from eschewing the use of enclosures like quotation marks, brackets, braces, and open/close-tags which can be hard for the human eye to balance in nested hierarchies.

Examples

Sample document

Data structure hierarchy is maintained by outline indentation.

---
receipt:    Oz-Ware Purchase Invoice
date:        2007-08-06
customer:
    given:   Dorothy
    family:  Gale
   
items:
    - part_no:   A4786
      descrip:   Water Bucket (Filled)
      price:     1.47
      quantity:  4

    - part_no:   E1628
      descrip:   High Heeled "Ruby" Slippers 
      price:     100.27
      quantity:  1

bill-to:  &id001
    street: | 
            123 Tornado Alley
            Suite 16
    city:   East Westville
    state:  KS

ship-to:  *id001   

specialDelivery:  >
    Follow the Yellow Brick
    Road to the Emerald City.
    Pay no attention to the 
    man behind the curtain.
...

Notice that strings do not require enclosure in quotations. That sample document defines a hash with 7 top level keys: one of the keys, "items", contains a 2 element array (or "list"), each element of which is itself a hash with four keys. Relational data and redundancy removal are displayed: the "ship-to" hash content is copied from the "bill-to" hash's content as indicated by the anchor(&) and reference(*) labels. The specific number of spaces in the indentation is unimportant as long as the hierarchy order is maintained and parallel elements have the same left justification. Optional blank lines can be added for readability. Multiple documents can exist in a single file/stream and are separated by "---". An optional "..." can be used at the end of a file (useful for signalling an end in streamed communications without closing the pipe).

Language elements

Basic components of YAML

YAML offers both an indented and an "in-line" style for denoting hashes and arrays. Here is a sampler of the components.

Lists

Conventional block format uses a dash to begin a new item in list

--- # Favorite movies
- Casablanca
- North by Northwest
- Notorious

Optional inline format is delimited by comma+space and enclosed in brackets (similar to JSON)

--- # Shopping list 
[milk, pumpkin pie, eggs, juice]

Hashes

--- # Block
name: John Smith
age: 33
--- # Inline
{name: John Smith, age: 33}

Block literals

Strings do not require quotation.

Newlines preserved
--- |
  There was a young fellow of Warwick
  Who had reason for feeling euphoric
      For he could, by election
      Have triune erection
  Ionic, Corinthian, and Doric

By default, trailing white space is stripped. Use |+ to keep trailing whitespace. Leading whitespace is trimmed to first line's indent. Use |8 to add a leading whitespace indent (where 8 is any number).

Newlines folded
--- >
  Wrapped text
  will be folded
  into a single
  paragraph
  
  Blank lines denote
  paragraph breaks

By default, folded text includes one space between lines. Use >- when full white space stripping is desired.

Hierarchical combinations of elements

Lists of hashes
- {name: John Smith, age: 33}
- name: Mary Smith
  age: 27
Hashes of lists
men: [John Smith, Bill Jones]
women:
  - Mary Smith
  - Susan Williams

Syntax

A compact cheat-sheet (actually written in YAML) as well as a full specification are available at yaml.org. The following is a synopsis of the basic elements.

  • YAML streams are encoded using the set of printable Unicode characters, either in UTF-8 or UTF-16
  • Whitespace indentation is used to denote structure; however tab characters are never allowed as indentation
  • List members are denoted by a leading hyphen ( - ) with one member per line, or enclosed in square brackets ( [ ] ) and separated by comma space ( ,   ).
  • Hashes are represented using the colon space ( :   ) in the form key: value, either one per line or enclosed in curly braces ( {   } ) and separated by comma space ( ,   ).
    • A hash key may prefixed with a question mark ( ? ) to allow for liberal multi-word keys to be represented unambiguously.
  • Strings (scalars) are ordinarily unquoted, but may be enclosed in double-quotes ( " ), or single-quotes ( ' ).
    • Within double-quotes, special characters may be represented with C-style escape sequences starting with a backslash ( \ ).
  • Block scalars are delimited with indentation with optional modifiers to preserve ( | ) or fold ( > ) newlines
  • Multiple documents within a single stream are separated by three hyphens ( --- )
    • three periods ( ... ) optionally end a file within a stream
  • Repeated nodes are initially denoted by an ampersand ( & ) and thereafter referenced with an asterisk ( * )
  • Comments begin with the number sign ( # ), can start anywhere on a line, and continue until the end of the line
  • Nodes may be labeled with a type or tag using the exclamation point ( !! ) followed by a string which can be expanded into a URI.
  • YAML documents in a stream may be preceded by directives composed of a percent sign ( % ) followed by a name and space delimited parameters. Two directives are defined in YAML 1.1:
    • The %YAML directive is used to identify the version of yaml in a given document.
    • The %TAG directive is used as a shortcut for URI prefixes. These shortcuts may then be used in node type tags.

YAML requires that colons and commas used as list separators be followed by a space so that scalar values containing embedded punctuation (such as 5,280 or http://www.wikipedia.org) can generally be represented without needing to be enclosed in quotes.

Two additional sigil characters are reserved in YAML for possible future standardisation: the at sign ( @ ) and accent grave ( ` ).

Advanced components of YAML

As discussed in a subsequent section, two features that distinguish YAML from the capabilities of other data serialization languages are Data Typing and Relational trees.

Data types

Explicit data typing is an advanced topic and seldom seen in the majority of YAML documents since YAML autodetects simple types. Data types can be divided into three categories: core, defined, and user-defined. Core are ones expected to exist in any parser (.e.g floats, ints, strings, lists, maps, ...). Other more advanced data types, such as binary data, are defined in the YAML specification but not supported in all implementations. Finally YAML defines a way to extend the data type definitions locally to accommodate user defined classes, structures or primitives (e.g. quad precision floats).

Casting data types

YAML autodetects the datatype of the entity. Sometimes one wants to cast the datatype explicitly. The most common situation is a single word string that looks like a number, boolean or tag may need disambiguation by surrounding it with quotes or use of an explicit datatype tag.

---
a: 123                     # an integer
b: "123"                   # a string, disambiguated by quotes
c: 123.0                   # a float
d: !!float 123             # also a float via explicit data type prefixed by (!!)
e: !!str 123               # a string, disambiguated by explicit type
f: !!str Yes               # a string via explicit type
g: Yes                     # a boolean True
h: Yes we have No bananas  # a string, "Yes" and "No" disambiguated by context.
Other specified data types

Not every implementation of YAML has every specification-defined data type. These built-in types use a double exclamation sigil prefix(!!). Particularly interesting ones not shown here are sets, ordered maps, timestamps, and hexadecimal. Here's an example of binary data.

---
picture: !!binary |
 R0lGODlhDAAMAIQAAP//9/X
 17unp5WZmZgAAAOfn515eXv
 Pz7Y6OjuDg4J+fn5OTk6enp
 56enmleECcgggoBADs=mZmE
Extension for user-defined data types

Many implementations of YAML can support user defined data types. This is a good way to serialize an object. Local data types are not universal data types but are defined in the application using the YAML parser library. Local data types use a single exclamation mark(!).

---
myObject:  !myClass { name: Joe, age: 15}

Relational trees

Data merge and references

Another advanced, less common topic. For clarity, compactness, and avoiding data entry errors, YAML provides node references(*) and hash merges(<<) that refer to a node labeled with an anchor (&) tag. References branch the tree to the anchor and work for all data types. (see the ship-to reference in the example above). Merges are for hashes only, and merge the keys at the anchor into the referring hashmap.

Merges and references are automatically expanded by the parser when the data structure is instantiated. This can greatly enhance readability and facilitate editing: below is an example of a queue in an instrument sequencer in which each subsequent step only lists the elements that are changed from the first step. When a YAML parser loads this array, all the "step" hashes will have the 5 keys specified in first step.

# sequencer protocols for Laser eye surgery
---
- step:  &id001                  # defines anchor label &id001
    instrument:      Lasik 2000
    pulseEnergy:     5.4
    pulseDuration:   12
    repetition:      1000
    spotSize:        1mm

- step:
     <<: *id001                  # merges key:value pairs defined in step1 anchor
     spotSize:       2mm         # overrides "spotSize" key's value

- step:
     <<: *id001                  # merges key:value pairs defined in step1 anchor
     pulseEnergy:    500.0       # overrides key
     alert: >                    # adds additional key
           warn patient of 
           audible pop

Comparison to other data structure format languages

YAML shares similarities with JSON, XML and SDL, these are natural points of comparison. YAML has characteristics that are unique in comparison to many other similar format languages.

JSON

JSON syntax is nearly[3] a subset of YAML and most JSON documents can be parsed by a YAML parser.[4] This is because JSON's semantic structure is equivalent to the optional "inline-style" of writing YAML. While extended hierarchies can be written in inline-style like JSON, this is not a recommended YAML style except when it aids clarity. YAML has additional features lacking in JSON such as extensible data types, relational anchors, strings without quotation marks, and mapping types preserving key order.

XML and SDL

YAML lacks the notion of tag attributes that are found in XML and SDL. For data structure serialization, tag attributes are, arguably, a feature of questionable utility since the separation of data and meta-data adds complexity when represented by the natural data structures (hashes, arrays) in common languages. [5] Instead YAML has extensible type declarations (including class types for objects). YAML itself does not have XML's language-defined document schema descriptors that allow, for example, a document to self validate. However, a YAML schema descriptor language exists, and YAXML, which represents YAML data structures in XML, allows XML schema importers and output mechanisms like XSLT to be applied to YAML. Moreover, in typical use, the semantics provided by rich language-defined type-declarations in the YAML document itself eliminates the need for an additional validator.

Indented delimiting

Because YAML primarily relies on outline indentation for structure, it is especially resistant to delimiter collision. YAML's insensitivity to quotes and braces in scalar values means one may embed XML, SDL, JSON or even YAML documents inside a YAML document by simply indenting it in a block literal. Conversely, to place YAML in XML or SDL content requires converting all whitespace and sigils (like <,> and &) to entity syntax. To place YAML in JSON requires quoting it, and escaping all interior quotes.

---
example: HTML goes into YAML without modification
message: !xml |
        <font name='times' size=10>
         <p><i>"Three is always greater than
                two, even for large values of two"</i>
          </p><p>    --Author Unknown    </p></font>
date: 2007-06-01

Non-hierarchical data models

Unlike SDL, and JSON, which can only represent data in a hierarchical model with each child node having a single parent, YAML also offers a simple relational scheme that allows repeats of identical data to be referenced from two or more points in the tree rather than entered redundantly at those points. This is similar to the facility IDREF built into XML. [6] The YAML parser then expands these references into the fully populated data structures they imply when read in, so whatever program is using the parser does not have to be aware of a relational encoding model, unlike XML processors which do not expand references. This expansion can enhance readability while reducing data entry errors in configuration files or processing protocols where many parameters remain the same in a sequential series of records while a few vary. An example being that "ship-to" and "bill-to" records in an invoice are nearly always the same data.

Practical considerations

YAML is line oriented and thus it is often simple to convert the unstructured output of existing programs into YAML format while having them retain much of the look of the original document. Because there are no close-tags, braces and quotation marks to balance it is generally easy to generate well-formed YAML directly from distributed print statements within unsophisticated programs. Likewise, the white space delimiters facilitate quick-and-dirty filtering of YAML files using the line oriented commands in grep, awk, perl, ruby, and python.

In particular, unlike mark-up languages, chunks of consecutive YAML lines tend to be well-formed YAML documents themselves. This makes it very easy to write parsers that do not have to process a document in its entirety (e.g. balancing open and close-tags and navigating quoted and escaped characters) before they begin extracting specific records within. This property is particularly expedient when iterating over records in a file whose entire data structure is too large to hold in memory, or for which reconstituting the entire structure to extract one item would be prohibitively expensive.

Counterintuitively, although its indented delimiting might seem to complicate deeply nested hierarchies, YAML handles indents as small as a single space, and this may achieve better compression than markup languages. Additionally, extremely deep indentation can be avoided entirely by either: 1) reverting to "inline-style" (i.e JSON-like format) without the indentation; or 2) using relational anchors to unwind the hierarchy to a flat form that the YAML parser will transparently reconstitute into the full data structure.

Security

YAML is purely a data representation language and thus has no executable commands.[7] This means that parsers will be (or at least should be) safe to apply to tainted data without fear of a latent command-injection security hole. For example, because JSON is native JavaScript it's tempting to use the JavaScript interpreter itself to evaluate the data structure into existence, leading to command injection holes when inadequately verified. While safe parsing is inherently possible in any data language, implementation is such a notorious pitfall that YAML's lack of an associated command language may be a relative security benefit.

Data processing and representation

The XML[8][9] and YAML specifications[10] provide very different logical models for data node representation, processing, and storage.

XML: The primary logical structures in an XML instance document are: 1) Element; and 2) Element attribute.[11] For these primary logical structures, the base XML specification does not define constraints regarding such factors as duplication of elements or the order in which they are allowed to appear.[12] In defining conformance for XML processors, the XML specification generalizes them into two types: 1) validating ; and 2) non-validating.[13] The XML specification asserts no detailed definitions for: an API; processing model; or data representation model; although several are defined in separate specifications that a user or specification implementor may choose independently. These include the Document Object Model and XQuery.

A richer model for defining valid XML content is the W3C XML Schema standard[14]. This allows for full specification of valid XML content and is supported by a wide range of open source, free and commercial processors and libraries.

YAML: The primary logical structures in a YAML instance document[15] are: 1) Scalar; 2) Sequence; and 3) Mapping.[16] The YAML specification also indicates some basic constraints that apply to these primary logical structures. For example, according to the specification, mapping keys do not have an order. In every case where node order is significant, a sequence must be used.[17]

Moreover, in defining conformance for YAML processors, the YAML specification defines two primary operations: 1) Dump; and 2) Load. All YAML-compliant processors must provide at least one of these operations, and may optionally provide both.[18] Finally, the YAML specification defines an information model or "representation graph" which must be created during processing for both Dump and Load operations, although this representation need not be made available to the user through an API.[19]

Implementations

Portability

Simple YAML files (e.g. key value pairs) are readily parsed with regular expressions without resort to a formal YAML parser. YAML emitters and parsers for many popular languages written in the pure native language itself exist, making it portable in a self-contained manner. Bindings to C-libraries also exist when speed is needed.

C libraries

  • libYAML As of 2007-06, this implementation of YAML 1.1 is stable and recommended by the YAML specification authors[20] for production use (despite the 0.0.1 version number and a mild caution that the API is not barred from evolution.).
  • SYCK This implementation supports most of YAML 1.0 specification and is in widespread use. It is optimized for use with higher level interpreted languages, obtaining speed by writing directly to the symbol table of the higher level language when it can. Unfortunately, as of 2005 it is no longer maintained and has some incompatibilities with the specification.

Bindings

Bindings for YAML exist for the following languages:

Pitfalls and implementation defects

  • Editors:
    • An editor mode that autoexpands tabs to spaces and displays text in a fixed-width font is recommended.
  • Strings:
    • For readability, and avoiding the need for meta-escape sequences, it's desirable to avoid quoted strings. However, this leads to a pitfall when inline strings are ambiguous single words (e.g. digits or boolean words) or when the un-quoted phrase accidentally contains a YAML construct ( e.g. a leading exclamation point or a colon-space after a word: "!Caca de vaca!" or "Caution: lions ahead"). This is not an issue anyone using a proper YAML emitter will confront, but can come up in ad hoc scripts or human editing of files. What it comes down to is if the data structure creator has control over content of the unquoted inline strings or not. If not then using either !!str tags or enclosing them in quotes may be a good practice. The !!str tag is preferable if the string itself may contain quotation marks. Another, simpler, approach is to use block literals ("|" or ">") rather than inline string expressions as these have no such ambiguities to resolve.
  • Anticipating implementation idiosyncrasies:
    • Some implementations of YAML, such as perl's YAML::BASE will load an entire file (stream) and parse it en-mass. Conversely, YAML::Tiny only reads the first document in the stream and stops. Other implementations like pyYaml are lazy and iterate over the next document only upon request. For very large files in which one plans to handle the documents independently, instantiating the entire file before processing may be prohibitive. Thus in YAML::BASE, occasionally one must chunk a file into documents and parse those individually. Fortunately, YAML makes this easy since this simply requires splitting on the document separator, m/^---/. That strategy could be disrupted if anchor and reference tags happen to lie in different documents of the same file.

See also

Other simplified markup languages include:

Notes and references

  1. ^ For purposes of this article, the terms (list and array), (hash, dictionary and mapping) and (string and scalar) are used interchangeably. Such usage is a simplification and may not be correct when specifically applied to some programming languages.
  2. ^ A hierarchical model only gives a fixed, monolithic view of the tree structure. For example, either actors under movies, or movies under actors. YAML allows both using a relational model.
  3. ^ The syntax differences are subtle and seldom arise in practice: JSON allows extended charactersets like UTF-32, YAML requires a space after separators like comma, equals, and colon while JSON does not, and some non-standard implementations of JSON extend the grammar to include Javascript's /*...*/ comments. Handling such edge cases may require light pre-processing of the JSON before parsing as in-line YAML
  4. ^ Parsing JSON with SYCK
  5. ^ In Markup Languages, attribute values in an open-tag must be handled separately from the data value enclosed by the tags. Typically, to hold this in a data structure means each node is an object with storage for the tag-name plus a hash for any possible named attributes and their values, and then a separate scalar for holding any enclosed data. YAML treats these even-handely: each node is simple type, usually a scaler, array, or hash.
  6. ^ XML IDREF>
  7. ^ A proposed "yield" tag will allow for simple arithmetic calculation
  8. ^ "Extensible Markup Language (XML) 1.0 (Fourth Edition)". Retrieved 2007-11-04.
  9. ^ "Extensible Markup Language (XML) 1.1 (Second Edition)". Retrieved 2007-11-04.
  10. ^ "YAML Ain't Markup Language (YAML™) Version 1.1". Retrieved 2007-11-04.
  11. ^ http://www.w3.org/TR/xml11/#sec-logical-struct
  12. ^ Note, however, that the XML specification does define an "Element Content Model" for XML instance documents that include validity constraints. Validity constraints are user-defined and not mandatory for a well-formed XML instance document. http://www.w3.org/TR/xml11/#sec-element-content. In the case of duplicate Element attribute declarations, the first declaration is binding and later declarations are ignored [1].
  13. ^ http://www.w3.org/TR/REC-xml/#sec-conformance
  14. ^ http://www.w3.org/XML/Schema>
  15. ^ The YAML specification identifies an instance document as a "Presentation" or "character stream". [2]
  16. ^ Additional, optional-use, logical structures are enumerated in the YAML types repository."Language-Independent Types for YAML™ Version 1.1". Retrieved 2007-11-04.The tagged types in the YAML types repository are optional and therefore not essential for conformant YAML processors. "The use of these tags is not mandatory."
  17. ^ [3]
  18. ^ "Dump" and "Load" operations consist of a few sub-operations, not of all of which need to be exposed to the user or through an API, (see http://yaml.org/spec/current.html#id2504671).
  19. ^ http://yaml.org/spec/current.html#representation
  20. ^ YAML creator Clark Evans inserted this recommendation.