3. How it works
At the end of the day, parsing an input against an Invisible XML grammar is really parsing that input against an Earley grammar constructed to recognize the same set of inputs (sometimes called “sentences”). When everything works, this is completely concealed. Unfortunately, when things go wrong, coffeepot can do little but show you the underlying grammar.
You can peek at this directly, even on a successful parse, with the
--show-grammar
option.
The Invisible XML grammar is much more directly expressive than the rules allowed by the Earley parser. The Earley parser accepts a set of rules that map nonterminals to a flat sequence of symbols (either terminals or nonterminals). The underlying grammar has no direct support for optionality or repeatability.
In order to support these “higher level” concepts, the parser rewrites the rules in terms of additional nonterminals. (Some of these rewrites are described in the Hints for Implementors section of the Invisible XML specification.)
Let’s look at some examples.
3.1 Mapping a simple rule
Simple rules, like a match for “aba”, map in a straightforward way.
s: a, b, a . { match "aba" }
a: 'a' .
b: 'b' .
The resulting Earley grammar has introduced a new nonterminal “$$” as the start symbol. This means that all of the grammars will have the same start symbol, irrespective of the users grammar.
$$ ⇒ s
s ⇒ a, b, a
a ⇒ 'a'
b ⇒ 'b'
3.2 Mapping alternatives
An Invisible XML grammar must have exactly one rule for each nonterminal. Alternatives must be expressed in that single rule:
s: a, b, a ; b, a, b . { match "aba" or "bab" }
a: 'a' .
b: 'b' .
In the Earley grammar, there are no alternatives, but there is no prohibition on multiple rules for the same nonterminal.
$$ ⇒ s
s ⇒ a, b, a
s ⇒ b, a, b
a ⇒ 'a'
b ⇒ 'b'
3.3 Mapping “x?”
Invisible XML uses “?” to express that a symbol is optional.
s: a, b?, a . { match "aba" or "aa" }
a: 'a' .
b: 'b' .
Earlier, I said the Earley grammar had no direct support for optionality.
That’s true. But the CoffeeGrinder
implementation does allow
a symbol to be marked as optional. This will ultimately get transformed into two
rules (s ⇒ a, b, a
and s ⇒ a, a
), it’s simply marked
with a question mark in the displayed grammar.
$$ ⇒ s
s ⇒ a, b?, a
a ⇒ 'a'
b ⇒ 'b'
3.4 Mapping “(x1, x2)?”
Making a group optional does require the introduction of a new nonterminal. The Earley parser doesn’t know anything about groups.
s: a, (b, a)? . { match "aba" or "a" }
a: 'a' .
b: 'b' .
Most nonterminals are just numeric. A few, like the rewrite for optionality, include some other description of their purpose. There’s no obvious way to provide meaningful names for the nonterminals introduced during rewriting. Attempting to decorate the names of the nonterminals in the original grammar produces much longer names which have drawbacks of their own.
$$ ⇒ s
s ⇒ a, $1_option?
a ⇒ 'a'
b ⇒ 'b'
$1_option ⇒ b, a
3.5 Mapping “x*”
Invisible XML uses “*” to express that a symbol may be repeated zero or more times.
s: a, b*, a . { match "aa", "aba", "abba", "abbba" ... }
a: 'a' .
b: 'b' .
This requires a couple of new nonterminals. The superscript “n”
next to $1
indicates that it is “nullable”. That is to say,
it is possible for it to match the empty string, sometimes called “ε”.
Pay particular attention to nullable nonterminals, either your own or ones introduced by rewriting. The most common source of ambiguity in a grammar is to allow more than one nonterminal at a particular location to map to ε.
$$ ⇒ s
s ⇒ a, $1ⁿ, a
a ⇒ 'a'
b ⇒ 'b'
$1ⁿ ⇒ $2?
$2 ⇒ b, $2?
3.6 Mapping “x+”
Invisible XML uses “+” to express that a symbol may be repeated one or more times.
s: a, b+, a . { match "aba", "abba", "abbba" ... }
a: 'a' .
b: 'b' .
This requires three new nonterminals.
$$ ⇒ s
s ⇒ a, $1, a
a ⇒ 'a'
b ⇒ 'b'
$1 ⇒ b, $2ⁿ
$2ⁿ ⇒ $3?
$3 ⇒ b, $2ⁿ
3.7 Mapping “x1**x2”
Invisible XML uses “**” between two symbols to indicate
that the first symbol may be repeated zero or more times, separated by
the second. For example, name*','
accepts zero or more names
(however they’re defined) separated by commas.
This is a clever syntactic convenience for authors.
s: a, b**'.', a . { match "aa", "aba", "ab.ba", "ab.b.ba" ... }
a: 'a' .
b: 'b' .
It’s a bit of a mess for the Earley grammar.
$$ ⇒ s
s ⇒ a, $1ⁿ, a
a ⇒ 'a'
b ⇒ 'b'
$1ⁿ ⇒ $2?
$2 ⇒ b, $3ⁿ
$3ⁿ ⇒ $4?
$4 ⇒ '.', b, $3ⁿ
3.8 Mapping “x1++x2”
Invisible XML uses “++” between two symbols to indicate that the first symbol may be repeated one or more times, separated by the second.
s: a, b++'.', a . { match "aba", "ab.ba", "ab.b.ba" ... }
a: 'a' .
b: 'b' .
Also a bit of a mess for the Earley grammar.
$$ ⇒ s
s ⇒ a, $1, a
a ⇒ 'a'
b ⇒ 'b'
$1 ⇒ b, $2ⁿ
$2ⁿ ⇒ $3?
$3 ⇒ '.', b, $3?
3.9 Mapping ¯\_(ツ)_/¯
Of course, the syntax allows these rules to be mixed more-or-less arbitrarily.
{ match "a", "aa", "aba", "ab-ba", "ab.ba" ... }
s: a+, (b**'-' ; b++'.')?, a* .
a: 'a' .
b: 'b' .
Consequently, real grammars tend to be quite messy. (Not that I’d call this a real grammar in any useful sense.)
$$ ⇒ s
s ⇒ $2, |$1?, $8ⁿ
a ⇒ 'a'
b ⇒ 'b'
|$1 ⇒ $10ⁿ
|$1 ⇒ $5
$2 ⇒ a, $3ⁿ
$3ⁿ ⇒ $4?
$4 ⇒ a, $3ⁿ
$5 ⇒ b, $6ⁿ
$6ⁿ ⇒ $7?
$7 ⇒ '.', b, $7?
$8ⁿ ⇒ $9?
$9 ⇒ a, $9?
$10ⁿ ⇒ $11?
$11 ⇒ b, $12ⁿ
$12ⁿ ⇒ $13?
$13 ⇒ '-', b, $12ⁿ
3.10 A word about “-”, “@”, and “^”
The marks added to the grammar that reflect how (or if) a symbol should be output have no effect on the rule rewriting. They’re handled as attributes on the symbols.
3.11 Insertions
T.B.D.