3. How it works

At the end of the day, parsing an input against an Invisible XML grammar is really parsing that input against an Earley grammar constructed to recognize the same set of inputs (sometimes called “sentences”). When everything works, this is completely concealed. Unfortunately, when things go wrong, coffeepot can do little but show you the underlying grammar.

You can peek at this directly, even on a successful parse, with the --show-grammar option.

The Invisible XML grammar is much more directly expressive than the rules allowed by the Earley parser. The Earley parser accepts a set of rules that map nonterminals to a flat sequence of symbols (either terminals or nonterminals). The underlying grammar has no direct support for optionality or repeatability.

In order to support these “higher level” concepts, the parser rewrites the rules in terms of additional nonterminals. (Some of these rewrites are described in the Hints for Implementors section of the Invisible XML specification.)

Let’s look at some examples.

3.1 Mapping a simple rule

Simple rules, like a match for “aba”, map in a straightforward way.

s: a, b, a .            { match "aba" }
a: 'a' .
b: 'b' .
Example 3.1.1 The Invisible XML grammar

The resulting Earley grammar has introduced a new nonterminal “$$” as the start symbol. This means that all of the grammars will have the same start symbol, irrespective of the users grammar.

$$ ⇒ s
s ⇒ a, b, a
a ⇒ 'a'
b ⇒ 'b'
Example 3.1.2 The Earley grammar

3.2 Mapping alternatives

An Invisible XML grammar must have exactly one rule for each nonterminal. Alternatives must be expressed in that single rule:

s: a, b, a ; b, a, b .  { match "aba" or "bab" }
a: 'a' .
b: 'b' .
Example 3.2.1 The Invisible XML grammar

In the Earley grammar, there are no alternatives, but there is no prohibition on multiple rules for the same nonterminal.

$$ ⇒ s
s ⇒ a, b, a
s ⇒ b, a, b
a ⇒ 'a'
b ⇒ 'b'
Example 3.2.2 The Earley grammar

3.3 Mapping “x?”

Invisible XML uses “?” to express that a symbol is optional.

s: a, b?, a .           { match "aba" or "aa" }
a: 'a' .
b: 'b' .
Example 3.3.1 The Invisible XML grammar

Earlier, I said the Earley grammar had no direct support for optionality. That’s true. But the CoffeeGrinder implementation does allow a symbol to be marked as optional. This will ultimately get transformed into two rules (s ⇒ a, b, a and s ⇒ a, a), it’s simply marked with a question mark in the displayed grammar.

$$ ⇒ s
s ⇒ a, b?, a
a ⇒ 'a'
b ⇒ 'b'
Example 3.3.2 The Earley grammar

3.4 Mapping “(x1, x2)?”

Making a group optional does require the introduction of a new nonterminal. The Earley parser doesn’t know anything about groups.

s: a, (b, a)? .         { match "aba" or "a" }
a: 'a' .
b: 'b' .
Example 3.4.1 The Invisible XML grammar

Most nonterminals are just numeric. A few, like the rewrite for optionality, include some other description of their purpose. There’s no obvious way to provide meaningful names for the nonterminals introduced during rewriting. Attempting to decorate the names of the nonterminals in the original grammar produces much longer names which have drawbacks of their own.

$$ ⇒ s
s ⇒ a, $1_option?
a ⇒ 'a'
b ⇒ 'b'
$1_option ⇒ b, a
Example 3.4.2 The Earley grammar

3.5 Mapping “x*”

Invisible XML uses “*” to express that a symbol may be repeated zero or more times.

s: a, b*, a .           { match "aa", "aba", "abba", "abbba" ... }
a: 'a' .
b: 'b' .
Example 3.5.1 The Invisible XML grammar

This requires a couple of new nonterminals. The superscript “n” next to $1 indicates that it is “nullable”. That is to say, it is possible for it to match the empty string, sometimes called “ε”.

Pay particular attention to nullable nonterminals, either your own or ones introduced by rewriting. The most common source of ambiguity in a grammar is to allow more than one nonterminal at a particular location to map to ε.

$$ ⇒ s
s ⇒ a, $1ⁿ, a
a ⇒ 'a'
b ⇒ 'b'
$1ⁿ ⇒ $2?
$2 ⇒ b, $2?
Example 3.5.2 The Earley grammar

3.6 Mapping “x+”

Invisible XML uses “+” to express that a symbol may be repeated one or more times.

s: a, b+, a .           { match "aba", "abba", "abbba" ... }
a: 'a' .
b: 'b' .
Example 3.6.1 The Invisible XML grammar

This requires three new nonterminals.

$$ ⇒ s
s ⇒ a, $1, a
a ⇒ 'a'
b ⇒ 'b'
$1 ⇒ b, $2ⁿ
$2ⁿ ⇒ $3?
$3 ⇒ b, $2ⁿ
Example 3.6.2 The Earley grammar

3.7 Mapping “x1**x2”

Invisible XML uses “**” between two symbols to indicate that the first symbol may be repeated zero or more times, separated by the second. For example, name*',' accepts zero or more names (however they’re defined) separated by commas.

This is a clever syntactic convenience for authors.

s: a, b**'.', a .        { match "aa", "aba", "ab.ba", "ab.b.ba" ... }
a: 'a' .
b: 'b' .
Example 3.7.1 The Invisible XML grammar

It’s a bit of a mess for the Earley grammar.

$$ ⇒ s
s ⇒ a, $1ⁿ, a
a ⇒ 'a'
b ⇒ 'b'
$1ⁿ ⇒ $2?
$2 ⇒ b, $3ⁿ
$3ⁿ ⇒ $4?
$4 ⇒ '.', b, $3ⁿ
Example 3.7.2 The Earley grammar

3.8 Mapping “x1++x2”

Invisible XML uses “++” between two symbols to indicate that the first symbol may be repeated one or more times, separated by the second.

s: a, b++'.', a .        { match "aba", "ab.ba", "ab.b.ba" ... }
a: 'a' .
b: 'b' .
Example 3.8.1 The Invisible XML grammar

Also a bit of a mess for the Earley grammar.

$$ ⇒ s
s ⇒ a, $1, a
a ⇒ 'a'
b ⇒ 'b'
$1 ⇒ b, $2ⁿ
$2ⁿ ⇒ $3?
$3 ⇒ '.', b, $3?
Example 3.8.2 The Earley grammar

3.9 Mapping ¯\_(ツ)_/¯

Of course, the syntax allows these rules to be mixed more-or-less arbitrarily.

{ match "a", "aa", "aba", "ab-ba", "ab.ba" ... }
s: a+, (b**'-' ; b++'.')?, a* .
a: 'a' .
b: 'b' .
Example 3.9.1 The Invisible XML grammar

Consequently, real grammars tend to be quite messy. (Not that I’d call this a real grammar in any useful sense.)

$$ ⇒ s
s ⇒ $2, |$1?, $8ⁿ
a ⇒ 'a'
b ⇒ 'b'
|$1 ⇒ $10ⁿ
|$1 ⇒ $5
$2 ⇒ a, $3ⁿ
$3ⁿ ⇒ $4?
$4 ⇒ a, $3ⁿ
$5 ⇒ b, $6ⁿ
$6ⁿ ⇒ $7?
$7 ⇒ '.', b, $7?
$8ⁿ ⇒ $9?
$9 ⇒ a, $9?
$10ⁿ ⇒ $11?
$11 ⇒ b, $12ⁿ
$12ⁿ ⇒ $13?
$13 ⇒ '-', b, $12ⁿ
Example 3.9.2 The Earley grammar

3.10 A word about “-”, “@”, and “^”

The marks added to the grammar that reflect how (or if) a symbol should be output have no effect on the rule rewriting. They’re handled as attributes on the symbols.

3.11 Insertions

T.B.D.