-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathyaml-spec-1.2-comments.yaml
357 lines (268 loc) · 12.4 KB
/
yaml-spec-1.2-comments.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
---
'000':
- |+
This YAML grammar was generated from https://yaml.org/spec/1.2/spec.html by:
https://github.com/yaml/yaml-grammar/tree/master/bin/yaml-grammar-html-to-yaml
This grammar is a YAML file that should be loadable by any decent existing
YAML loader. It has also been converted to JSON in a file next to this one.
The YAML file has lots of helpful comments. The JSON (of course) does not.
The purpose of this grammar-as-yaml file is to make the spec clearer and more
precise to people implementing YAML parsers. The grammar in the YAML 1.2
specification is difficult to understand and in a few cases ambiguous or
incorrect.
This grammar mirrors the 1.2 spec faithfully, but tries to express the rules
in a more concrete and programatic fashion. Some of the rules that are more
difficult to understand, ambiguous or incorrect, have a comment section
before them. Each rule definition is preceded by the original YAML 1.2 spec
rule text, formatted as a comment.
The YAML Reference Parser project ( which can be found at
https://github.com/yaml/yaml-reference-parser/ ) uses this file to generate
100% correct YAML parsers in many programming languages.
- |2+
= Syntax Guide =
The following is an explanation of the various EBNF/DSL syntax forms used in
this grammar. The original YAML 1.2 spec uses an EBNF form that has more than
a couple oddities. This file tries to make as much sense of them as possible.
Each rule that introduces a new syntax is preceded by an explanation.
There are 211 rules in the YAML 1.2 spec. This grammar file has 211 sections,
one for each rule. The overall structure of this file is a big mapping with
2 mapping pairs for each grammar rule. Each rule group looks like this:
:rule-number: rule-name
# original-spec-rule-as-a-comment
rule-name: rule-definition
The first pair is just for indexing the rule (by rule number) back to the 1.2
spec. The rule definition is the important part. Complex definitions are data
structures mostly comprised of other rule-names. The simplest rules point to
single characters. This forms a top-down grammar where the top is the last
rule, #211.
== Rule Names ==
A rule name consists of groups of lower case characters and numbers separated
by dashes (`-`).
Each rule name has a 1 or 2 character prefix indicating the rule type.
Example rule names:
* nb-foo-bar
* nb+foo-bar
* nb-foo-bar(n)
* nb-foo-bar(n,c)
=== Rule Name Prefixes ===
The rule type prefixes are:
e -- Empty (matches no characters)
c -- Character (match one single character)
b -- Break (match a single line break character)
nb -- Non-break (match a single non-break character)
s -- Space (match a single whitespace character)
ns -- Non-space (match a single non-space character)
l -- Line (match a complete line)
And also (where X and Y are each one of the above prefixes):
X-Y -- Match starting with an X- match and ending with a Y- match
X+ -- Match X where indentation is greater than n
X-Y+ -- Match X-Y where indentation is greater than n
== Rules Variables ==
Some rules are defined with variable arguments indicating that they should be
passed values when being used. There are 4 distinct variables used in the
YAML 1.2 grammar.
n -- Current indentation level. An integer indicating the number of spaces.
m -- Additional indentation level. An integer indicating number of spaces.
c -- Current context. String that is one of:
* "block-in"
* "block-out"
* "block-key"
* "flow-in"
* "flow-out"
* "flow-key"
t -- How to treat whitespace after a literal scalar. One of:
* "strip"
* "clip"
* "keep"
Where and how these variables are set is a bit inconsistent. Some are set by
explicitly in rules that act like function calls. For example, The starting
values of n (-1) and c ("block-in") are set in rule 207.
The m variable is set explicitly in rules 183 and 187, albeit by an undefined
special rule called <auto-detect-indent>. In rules like 185 it assumes that m
is stored as a state/stack variable and has been set somewhere else.
The t variable is set using functional constructs in rule 164, depending on
the value of a c-chomping-indicator rule.
== Special Rule Names ==
A handful of rule names are words inside angle brackets. These are called
"special rules". They are not defined in the grammar. Each of these rules is
an assertion that doesn't consume any text in the parse. They are:
* <empty>
Always matches.
* <start-of-line>
Current parse position is at the start of a line.
* <end-of-stream>
Current parse position is at the end of the input stream.
* <auto-detect-indent>
Detect the number of characters of new indentation.
== Rule Definitions ==
The rule definitions here are expressed as YAML data structures where the
scalar values can be:
* A rule name
* A literal character (always single quoted for clarity)
* Some special DSL forms
The various DSL forms were chosen carefully to be easily read and understood
by the people reading it. The rest of this section describes those forms.
=== Special Syntax Used ===
There are some syntax conventions used for clarity. Everything is valid YAML.
Certain things could be expressed in YAML different ways. For instance a
single character to be matched could be unquoted, single-quoted or
double-quoted. If the intent is to match a single character, it is single
quoted.
* unquoted-word -- Usually a rule name
* unquoted-single-char -- A variable name
* single-quoted-char -- A character to match
* double-quoted-string -- A literal string value
* (...) -- A parsing function
* <...> -- A special rule name
=== Hex Codes ===
Hex codes are scalars beginning with an `x` followed by an even number of
uppercase hexadecimal digits. Since hex codes represent characters they are
encoded in single quotes.
There are hex ranges as well, which consist of a pair hex codes.
=== Rule Functions ===
The grammar uses these functions. The function name is always in parentheses.
The name is a key of a single pair mapping.
* (any)
One of the rules in the group must match.
* (all)
All of the rules in the group must match.
* (+++)
The rule must match 1 or more times.
* (***)
The rule must match 0 or more times.
* (???)
The rule must match 0 or 1 times.
* ({x})
The rule must match x times, where x is an integer or a variable.
* (===)
Positive look-ahead assertion.
* (!==)
Negative look-ahead assertion.
* (<==)
Positive look-behind assertion.
* (---)
A set of characters comprised of the first set minus each of the others.
* (...)
The argument variables that are passed to this rule. Either a single
value or a sequence of values.
* (case)
A variable indicated by `var` must match one of the other keys. The value
of the matching key pair is the next rule to be checked.
* (flip)
Some grammar rules like 136, don't attempt to match anything. They are
functions that change a variable. Rule 136 uses the (flip) function to
change a variable from one value to another.
* (if)
Check the rule indicated by the value. If it matches, call the (set)
function that must be provided.
* (set)
The value is a 2 element sequence. The first element is a variable and
the second is an expression. Set the variable to the value of the
evaluated expression.
* (ord)
Convert a character (1-9) to an integer.
* (match)
Return the string or character value of the last rule match.
* (len)
Return the length of a string value.
* (max)
The YAML spec grammar constrains some rules to complete in the next 1024
characters. The (max) function is used to indicate that this is the
case. How it is implemented (like many things) is left as an exercise for
the parser author.
* (exclude)
This one is a doozy. It indicates a rule that must not match during any
of the recursive sub-matches. Since it is used near the top of the
grammar, it affects almost everything.
* (+)
Return the first argument plus the second.
* (-)
Return the first argument minus the second.
* (<)
First argument is less than the second.
* (<=)
First argument is less than the second.
'001': |
This is the complete set of characters allowed in a YAML stream.
The (any) means that any one of the value group's rules must match for the
group to match. The remaining rules are attempted in order.
The rules beginning with `x` are hex codes. The pairs of hex codes are
character ranges.
'003': |
Rules 003-021 define single characters. These rules are never referenced in
the rest of the grammar. They should at least be referenced in rule 022.
'004': |
Rules 004 to 021 are single characters to match. Instead of a hex code, they
are specified as a single character in single quotes.
'022': |
Rule 022 should probably be defined using the rule names above rather than
literal characters.
'026': |
Rules can "call" other rules by name. This is called a rule reference. A
reference matches if its named rule matches.
'027': |
The (---) operator here matches if the next character is in the first range,
but not in any of the following ranges.
'028': |
A rule in in a group of rules can be another group. This is how parenthesized
groups are represented.
The (all) function matches if all the rules in its group match.
'059': |
The ({x}) is a numeric quantifier. ie The rule must match that exact number
of times.
'063': |
Some rules are intended to be called with arguments. Arguments are specified
with the (...) key. A single argument is a scalar value. Multiple arguments
are specified as a sequence.
'064': |
This rule means to match as many spaces as possible, then check the length of
the match is less than the indentation level in variable n. The (<<<) means
that if both assertions are not true, backtrack to where the rule started.
The (***) function means the rule must match 0 or more times.
The (<) function matches if the first argument is less than the second.
The (len) function returns the length of a string, and the (match) function
returns the last match.
'065': |
The (<=) function matches if the first argument is less or equal to the
second.
'066': |
The (+++) function means the rule must match 1 or more times.
The <start-of-line> rule is a "special" rule. It's not defined in the
grammar. It's an assertion that the current parse position is on a "start of
line" boundary.
'067': |
A (case) statement is a mapping that chooses the next rule based on the value
of a state variable (in this case `c`).
7 rules in this grammar have (case) statements.
'069': |
The (???) function means the rule must match 0 or 1 times.
'076': |
The <end-of-stream> rule is a "special" rule that matches when there is no
more text to parse.
'105': |
The <empty> rule is a "special" rule that always matches (but consumes no
text). It is used to for places in YAML that allow the absence of text to
represent an empty plain scalar.
'126': |
The (===) function is a positive lookahead. It checks to see if a character
is next, but doesn't consume it.
'136': |
The (flip) function uses a mapping to change an input value to another value,
and returns the new value.
'154': |
The (max) function tells the parser that it can consider only the given
number of characters to match the next rule.
'162': |
Productions 162-182 and 199 deal with folded and literal block scalars. These
productions are concerned with the variables:
m - relative indentation columns or 'auto-detect()'
t - trailing whitespace indicator
'183': |
The <auto-detect-indent> rule is a "special" rule that finds the new level of
indentation and returns the number of columns. The level must be greater than
0. It does not consume the indentation.
'207': |
The l-bare-document is the topmost rule for parsing a document.
The <excluding_c-forbidden_content> says that in every other rule you must
check that c-forbidden does not occur.
This is a bad design choice for a formal grammar.