yaml-spec-1.2-comments.yaml

---
'000':
- |+

  This YAML grammar was generated from https://yaml.org/spec/1.2/spec.html by:
  https://github.com/yaml/yaml-grammar/tree/master/bin/yaml-grammar-html-to-yaml

  This grammar is a YAML file that should be loadable by any decent existing
  YAML loader. It has also been converted to JSON in a file next to this one.
  The YAML file has lots of helpful comments. The JSON (of course) does not.

  The purpose of this grammar-as-yaml file is to make the spec clearer and more
  precise to people implementing YAML parsers. The grammar in the YAML 1.2
  specification is difficult to understand and in a few cases ambiguous or
  incorrect.

  This grammar mirrors the 1.2 spec faithfully, but tries to express the rules
  in a more concrete and programatic fashion. Some of the rules that are more
  difficult to understand, ambiguous or incorrect, have a comment section
  before them. Each rule definition is preceded by the original YAML 1.2 spec
  rule text, formatted as a comment.

  The YAML Reference Parser project ( which can be found at
  https://github.com/yaml/yaml-reference-parser/ ) uses this file to generate
  100% correct YAML parsers in many programming languages.

- |2+

    = Syntax Guide =

  The following is an explanation of the various EBNF/DSL syntax forms used in
  this grammar. The original YAML 1.2 spec uses an EBNF form that has more than
  a couple oddities. This file tries to make as much sense of them as possible.
  Each rule that introduces a new syntax is preceded by an explanation.

  There are 211 rules in the YAML 1.2 spec. This grammar file has 211 sections,
  one for each rule. The overall structure of this file is a big mapping with
  2 mapping pairs for each grammar rule. Each rule group looks like this:

    :rule-number: rule-name
    # original-spec-rule-as-a-comment

    rule-name: rule-definition

  The first pair is just for indexing the rule (by rule number) back to the 1.2
  spec. The rule definition is the important part. Complex definitions are data
  structures mostly comprised of other rule-names. The simplest rules point to
  single characters. This forms a top-down grammar where the top is the last
  rule, #211.


    == Rule Names ==

  A rule name consists of groups of lower case characters and numbers separated
  by dashes (`-`).

  Each rule name has a 1 or 2 character prefix indicating the rule type.

  Example rule names:

    * nb-foo-bar
    * nb+foo-bar
    * nb-foo-bar(n)
    * nb-foo-bar(n,c)

    === Rule Name Prefixes ===

  The rule type prefixes are:

    e  -- Empty (matches no characters)
    c  -- Character (match one single character)
    b  -- Break (match a single line break character)
    nb -- Non-break (match a single non-break character)
    s  -- Space (match a single whitespace character)
    ns -- Non-space (match a single non-space character)
    l  -- Line (match a complete line)

  And also (where X and Y are each one of the above prefixes):

    X-Y  -- Match starting with an X- match and ending with a Y- match
    X+   -- Match X where indentation is greater than n
    X-Y+ -- Match X-Y where indentation is greater than n


    == Rules Variables ==

  Some rules are defined with variable arguments indicating that they should be
  passed values when being used. There are 4 distinct variables used in the
  YAML 1.2 grammar.

    n -- Current indentation level. An integer indicating the number of spaces.

    m -- Additional indentation level. An integer indicating number of spaces.

    c -- Current context. String that is one of:
      * "block-in"
      * "block-out"
      * "block-key"
      * "flow-in"
      * "flow-out"
      * "flow-key"

    t -- How to treat whitespace after a literal scalar. One of:
      * "strip"
      * "clip"
      * "keep"

  Where and how these variables are set is a bit inconsistent. Some are set by
  explicitly in rules that act like function calls. For example, The starting
  values of n (-1) and c ("block-in") are set in rule 207.

  The m variable is set explicitly in rules 183 and 187, albeit by an undefined
  special rule called <auto-detect-indent>. In rules like 185 it assumes that m
  is stored as a state/stack variable and has been set somewhere else.

  The t variable is set using functional constructs in rule 164, depending on
  the value of a c-chomping-indicator rule.


    == Special Rule Names ==

  A handful of rule names are words inside angle brackets. These are called
  "special rules". They are not defined in the grammar. Each of these rules is
  an assertion that doesn't consume any text in the parse. They are:

    * <empty>
      Always matches.

    * <start-of-line>
      Current parse position is at the start of a line.

    * <end-of-stream>
      Current parse position is at the end of the input stream.

    * <auto-detect-indent>
      Detect the number of characters of new indentation.


    == Rule Definitions ==

  The rule definitions here are expressed as YAML data structures where the
  scalar values can be:

    * A rule name
    * A literal character (always single quoted for clarity)
    * Some special DSL forms

  The various DSL forms were chosen carefully to be easily read and understood
  by the people reading it. The rest of this section describes those forms.

    === Special Syntax Used ===

  There are some syntax conventions used for clarity. Everything is valid YAML.
  Certain things could be expressed in YAML different ways. For instance a
  single character to be matched could be unquoted, single-quoted or
  double-quoted. If the intent is to match a single character, it is single
  quoted.

    * unquoted-word         -- Usually a rule name
    * unquoted-single-char  -- A variable name
    * single-quoted-char    -- A character to match
    * double-quoted-string  -- A literal string value
    * (...)                 -- A parsing function
    * <...>                 -- A special rule name

    === Hex Codes ===

  Hex codes are scalars beginning with an `x` followed by an even number of
  uppercase hexadecimal digits. Since hex codes represent characters they are
  encoded in single quotes.

  There are hex ranges as well, which consist of a pair hex codes.

    === Rule Functions ===

  The grammar uses these functions. The function name is always in parentheses.
  The name is a key of a single pair mapping.

    * (any)
      One of the rules in the group must match.

    * (all)
      All of the rules in the group must match.

    * (+++)
      The rule must match 1 or more times.

    * (***)
      The rule must match 0 or more times.

    * (???)
      The rule must match 0 or 1 times.

    * ({x})
      The rule must match x times, where x is an integer or a variable.

    * (===)
      Positive look-ahead assertion.

    * (!==)
      Negative look-ahead assertion.

    * (<==)
      Positive look-behind assertion.

    * (---)
      A set of characters comprised of the first set minus each of the others.

    * (...)
      The argument variables that are passed to this rule. Either a single
      value or a sequence of values.

    * (case)
      A variable indicated by `var` must match one of the other keys. The value
      of the matching key pair is the next rule to be checked.

    * (flip)
      Some grammar rules like 136, don't attempt to match anything. They are
      functions that change a variable. Rule 136 uses the (flip) function to
      change a variable from one value to another.

    * (if)
      Check the rule indicated by the value. If it matches, call the (set)
      function that must be provided.

    * (set)
      The value is a 2 element sequence. The first element is a variable and
      the second is an expression. Set the variable to the value of the
      evaluated expression.

    * (ord)
      Convert a character (1-9) to an integer.

    * (match)
      Return the string or character value of the last rule match.

    * (len)
      Return the length of a string value.

    * (max)
      The YAML spec grammar constrains some rules to complete in the next 1024
      characters. The (max) function is used to indicate that this is the
      case. How it is implemented (like many things) is left as an exercise for
      the parser author.

    * (exclude)
      This one is a doozy. It indicates a rule that must not match during any
      of the recursive sub-matches. Since it is used near the top of the
      grammar, it affects almost everything.

    * (+)
      Return the first argument plus the second.

    * (-)
      Return the first argument minus the second.

    * (<)
      First argument is less than the second.

    * (<=)
      First argument is less than the second.

'001': |
  This is the complete set of characters allowed in a YAML stream.

  The (any) means that any one of the value group's rules must match for the
  group to match. The remaining rules are attempted in order.

  The rules beginning with `x` are hex codes. The pairs of hex codes are
  character ranges.
'003': |
  Rules 003-021 define single characters. These rules are never referenced in
  the rest of the grammar. They should at least be referenced in rule 022.
'004': |
  Rules 004 to 021 are single characters to match. Instead of a hex code, they
  are specified as a single character in single quotes.
'022': |
  Rule 022 should probably be defined using the rule names above rather than
  literal characters.
'026': |
  Rules can "call" other rules by name. This is called a rule reference. A
  reference matches if its named rule matches.
'027': |
  The (---) operator here matches if the next character is in the first range,
  but not in any of the following ranges.
'028': |
  A rule in in a group of rules can be another group. This is how parenthesized
  groups are represented.

  The (all) function matches if all the rules in its group match.
'059': |
  The ({x}) is a numeric quantifier. ie The rule must match that exact number
  of times.
'063': |
  Some rules are intended to be called with arguments. Arguments are specified
  with the (...) key. A single argument is a scalar value. Multiple arguments
  are specified as a sequence.
'064': |
  This rule means to match as many spaces as possible, then check the length of
  the match is less than the indentation level in variable n. The (<<<) means
  that if both assertions are not true, backtrack to where the rule started.

  The (***) function means the rule must match 0 or more times.

  The (<) function matches if the first argument is less than the second.

  The (len) function returns the length of a string, and the (match) function
  returns the last match.
'065': |
  The (<=) function matches if the first argument is less or equal to the
  second.
'066': |
  The (+++) function means the rule must match 1 or more times.

  The <start-of-line> rule is a "special" rule. It's not defined in the
  grammar. It's an assertion that the current parse position is on a "start of
  line" boundary.
'067': |
  A (case) statement is a mapping that chooses the next rule based on the value
  of a state variable (in this case `c`).

  7 rules in this grammar have (case) statements.
'069': |
  The (???) function means the rule must match 0 or 1 times.
'076': |
  The <end-of-stream> rule is a "special" rule that matches when there is no
  more text to parse.
'105': |
  The <empty> rule is a "special" rule that always matches (but consumes no
  text). It is used to for places in YAML that allow the absence of text to
  represent an empty plain scalar.
'126': |
  The (===) function is a positive lookahead. It checks to see if a character
  is next, but doesn't consume it.
'136': |
  The (flip) function uses a mapping to change an input value to another value,
  and returns the new value.
'154': |
  The (max) function tells the parser that it can consider only the given
  number of characters to match the next rule.
'162': |
  Productions 162-182 and 199 deal with folded and literal block scalars. These
  productions are concerned with the variables:

    m - relative indentation columns or 'auto-detect()'
    t - trailing whitespace indicator
'183': |
  The <auto-detect-indent> rule is a "special" rule that finds the new level of
  indentation and returns the number of columns. The level must be greater than
  0. It does not consume the indentation.
'207': |
  The l-bare-document is the topmost rule for parsing a document.

  The <excluding_c-forbidden_content> says that in every other rule you must
  check that c-forbidden does not occur.

  This is a bad design choice for a formal grammar.